THE SPEECHDAT (M) FRENCH TEXT MATERIAL
1. INTRODUCTION
In this paper, the construction of the word- and
sentence material for the response sheets used in the collection
of the SpeechDat (M) corpus for the French language is reported.
2. THE MATERIAL
The material consists of both prompted and free speech.
The advantage of prompted speech is the wide vocabulary coverage.
Free speech is more natural, but it takes more items to obtain
a good vocabulary coverage. Since speech recognition in voice
response applications is one of the major driving forces behind
the project, a relatively large part of the items are numbers,
times or dates.
The speech material to be collected should be appropriate
to train isolated and/or connected digit recognizers, as well
as sub-word based recognizers.
Three types of responses are prompted:
- 36 read items
- 8 spontaneous answers to printed questions
- 2 spontaneous answers to non-printed questions
The following items were chosen:
Items to be read:
- 1 isolated digit
- 3 connected digits
- 1 money amount
- 2 times (analog and digital)
- 2 dates (analog and digital)
- 3 words
- 3 spelled words (the same words as the 3 spoken
as normal words)
- 9 application words
- 3 application words in a carrier sentence
- 9 phonetically rich sentences
The following list of printed questions is asked:
- 4 yes/no questions
"Are you ready to start?"
"Do you use a cordless phone?"
"Have you lived in another country for a
long period of time?"
"Is French your native language?"
- 4 open questions
"Please, give a number between a thousand
and a million."
"What is your date of birth?"
"In which department did you go to school
for the first time?"
"Please, give your comments on this recording."
The following unprinted questions were prompted (but
the response form advised the subject that a question would be
asked at this point in the recording session):
- "Please, give a time between noon and midnight."
- "Please, give an amount of francs between
a thousand and a million."
The list of items is not exactly following the SpeechDat(M)
specifications; there is only one natural number instead of three,
extra items are:
- 3 extra application words,
- 1 extra yes/no question,
- 1 comments item,
- 1 extra time item,
- 3 words that are spelled are also read.
3. PHONETICALLY RICH SENTENCES
The SpeechDat (M) corpus for the French language
contains nine phonetically rich sentences per session. These sentences
were taken from a large text corpus that contained many issues
of the newspaper Le Monde. The IDIAP grapheme-phoneme converter
was used to create the most probable transcription of each sentence.
The last stage was to create 1,000 sets of nine sentences,
each sentence appearing in only one set. The sets were formed
in such a way that each one contained at least two tokens of all
phonemes in the French language (except for some very infrequent
phonemes). The list of phonemes (in SAMPA notation) finally used
for coverage purposes follows:
Phoneme Word Transcription
p pont p o~
b bon b o~
t temps t a~
d dans d a~
k quand k a~
g gant g a~
f femme f a m
v vent v a~
s sans s a~
z zone z o n
S champ S a~
Z gens Z a~
j ion j o~ (can be realized as a fricative or as an approximant)
m mont m o~
n nom n o~
J oignon o J o~
N camping k a~ p i N (only found in loan words)
l long l o~
R rond R o~
w coin k w e~
H juin Z H e~
i si s i
e ses s e
E seize s E z
a patte p a t
A pâte p A t
O comme k O m
o gros g R o
u doux d u
y du d y
2 deux d 2
9 neuf n 9 f
@ justement Z y s t @ m a~
e~ vin v e~
a~ vent v a~
o~ bon b o~
9~ brun b R 9~
For more detailed information, see the website http://www.phon.ucl.ac.uk/home/sampa/french.html
4. DETAILED ANALYSIS OF THE TEXT MATERIAL
4.1. One Isolated Digit
This item is READ.
Digit vocabulary: "un, deux, trois, quatre,
cinq, six, sept, huit, neuf, zéro" (zero ... nine)
The ten digits appeared with equal frequency on the
response sheets.
4.2. Three Connected Digits
Applications of digit string recognition are, among
others, telephone number and credit card number recognition. Thus,
the prompts aimed to elicit formulations typical for these types
of applications. All connected digits are READ.
- One 6-digit string: a number expressed as 6 single
digits (READ)
- One 8-digit string, introduced as a telephone number.
French telephone numbers are formulated as 4 2-digit strings;
printed in 4 blocks of 2 numbers each (READ)
- One 16-digit string, introduced as a credit card
number; printed in 4 blocks of 4 numbers each (READ)
Vocabulary:
- digits: "un, deux, ..., zéro"
- 'teens': "onze, douze, treize, quatorze, quinze,
seize, dix-sept, dix-huit, dix-neuf" (eleven ... nineteen)
- 'tys': "dix, vingt, trente, quarante, cinquante,
soixante, soixante-dix, quatre-vingt, quatre-vingt-dix" (ten,
twenty ... ninety)
For telephone and credit card numbers the syntax
of the strings obeyed the relevant rules (e.g., subscriber number
cannot start with a 0). For the rest care was taken to obtain
equal frequency for all vocabulary items.
4.3. One Natural Number
These data are useful to be able to construct speech
references for any non-negative number below one million. This
item is SPONTANEOUS, i.e., the speakers have been prompted to
say any number between 1,000 and 1,000,000: " S'il vous plait,
dites un nombre quelconque entre mille et un million" (Please,
say a number between one thousand and a million)
Vocabulary:
- digits
- 'teens'
- 'tys'
- "cent(s), mille, virgule" (cent(s), thousand,
comma)
Since this item is spontaneous, there is no a priori
attempt to obtain uniform distributions of the vocabulary items.
4.4. Two Money Amounts
The prompts aimed to elicit typical phrases used
with money amounts. Each prompt sheet contained one small money
amount (less than 100 FF), including centimes (however, only x0,
x5, 99 were used, with equal frequency [x = 1, .. , 9]), and one
large amount. The small money amount was READ. The large money
amount was SPONTANEOUS prompted by the surprise question "Please,
give an amount of francs between a thousand and a million."
Vocabulary:
- digits, 'teens', 'tys'
- "cent(s), mille"
- "franc(s), centime(s)"
4.5. Two Times of Day
There are two methods to report a time: 'digital',
e.g. 23:49, or 'analog', e.g. "sept heure et demi" (seven
hours and a half).
One time was digital, with preference for evening
times. On the session sheet they were represented as e.g. 20:15.
Only times of the form xx:x0 and xx:x5 have been prompted. The
other time has been printed explicitly as an analog time, including
the words "aujourd'hui" (today) and "demain"
(tomorrow). Both items are READ.
Vocabulary:
- aujourd'hui, demain
- digits 0-25, 30, 35, 40, 45, 50, 55.
- à matin, après-midi, heure(s), minute(s),
moins, quart, trois (quarts), et, le, demi, midi, minuit, soir,
un(e), ce(t, tte), nuit)
4.6. Three Dates
Again there are two ways to specify a date: 'digital',
e.g. 27/12/96 or 'analog', e.g. Vendredi, premier Mai 1996 (Fridays,
the first of May 1996). - One READ digital date, presented as
e.g. 27/12/1995 has been recorded.
All dates were in the interval between 1996 and 2000
- One READ analog date, including weekday
- One question concerning a date, resulting in a
SPONTANEOUS answer "Quelle-est votre date de naissance?"
(What is your date of birth?)
Vocabulary:
- weekdays: lundi, mardi, mercredi, jeudi, vendredi,
samedi, dimanche
- months: janvier, février, mars, avril, mai,
juin, juillet, août, septembre, octobre, novembre, décembre
- ordinal: premier
- cardinal numbers: digits 2-31
4.7. Four Yes/no questions
Four questions that should result in SPONTANEOUS
yes/no responses were
asked.
The following questions were printed on the sheet:
"Etes-vous prêt à commencer?"
Confirmation expected (Are you ready to start?)
"Utilisez-vous un téléphone sans
fil?" More negation than confirmation expected (Do you use
a cordless phone?)
"Avez-vous déjà vécue à
l'étranger pendant un long laps de
temps?" Negation expected (Have you lived
abroad for a long period of time?)
"Votre langue maternelle est-elle le francais?"
Confirmation expected (Is French your native
language?)
4.8. Three Spelled words
The words to be spelled were aimed at a uniform distribution
of all vocabulary letters. The spellings are READ. There was a
500 word inventory, which is extracted from the phonetically rich
material. The number of letters per word ranges from 5 to 11.
Vocabulary:
- A-Z
- apostrophe, accent-aigu, accent-grave, accent-circonflexe,
cédille, tréma, tiret, trait d'union,
lettre capitale, majuscule,
minuscule, deux (e.g. lettre: L E deux T R E)
- lexicon
The callers were first asked to read the words and
then to spell these.
4.9. 9 Application words
Each subject read 9 application words. In addition
three sentences were included to obtain READ utterances with embedded
keywords. These can be used for word-spotter training. There were
five different carrier phrases per keyword.
The following table lists the application words:
activer
aide
annonce
annuaire
annuler
appeler
arrêter
astérisque
autres
canal
changer
chaîne
composer
conférence
connecter
continuer
date
dièse
désactiver
effacer
en arrière
encore une fois
enregistrer
externe
fin
information
interne
jouer
lecture
menu
messages
modifier
mémoire
nom
nouveau
nouvelle
numéro
opérateur
pause
programmer
précédant
rappeler
rembobiner
retour
répondeur
répéter
station
stop
suite
suivant
transférer
téléphone
verrouiller
écouter
The 3 sentences have been READ.
4.10. Other
The following question was asked to obtain an indication
of the regional/ dialectical background of the caller.
"Dans quel département êtes-vous
allé à l'école la première fois?"
(In which department have you gone to school for the first time?)
Finally the caller was asked for comments. "S'il
vous plaît, faites-nous part de vos commentaires, remarques
ou impressions." (Please, give your comments, remarks or
impressions).
APPENDIX
APPENDIX 1. DATABASES
This section lists the databases and sections of
those databases which have been used to generate the 40 items
of the single session sheet templates.
1.1. Isolated digits
0 - 9
1.2. Connected digits: 6-, 8-, and 16-digit strings
6-digit strings
7 8 4 5 3 2
9 4 7 6 2 7
1 7 3 6 3 6
6 0 9 3 1 6
...
8-digit strings
64 64 73 16
76 71 38 80
78 02 36 88
82 29 51 44
15 19 32 30
...
16-digit strings
7578 9920 9568 4104
3187 2752 7569 3424
2429 9496 5895 4220
9560 2496 3191 3314
4418 7612 9150 1984
...
1.3. Money amounts
0 F
0 F 05
0 F 10
0 F 15
0 F 20
0 F 25
...
99 F 85
99 F 90
99 F 95
99 F 99
100 F
101 F
102 F
...
9996 F
9997 F
9998 F
9999 F
1.4. Times of day: digital
13 heures
13 heures 05
13 heures 10
13 heures 15
13 heures 20
...
23 heures 50
23 heures 55
0 heure
0 heure 05
1 heure 50
1 heure 55
2 heures
1.5. Times of day: analog
demain matin
demain après-midi
demain à minuit
demain à minuit et quart
demain midi
demain soir
ce midi
ce minuit
cet soir
cet après-midi
ce soir à minuit
cette nuit à minuit et un quart
minuit une
minuit deux
demain, 7 heures (et) 38 (minutes)
aujourd'hui, 23 heures moins 59
1.6. Dates: digital
1/1/1996
...
31/1/1996
...
25/12/1999
31/12/1999
1.7. Dates: analog
lundi, le premier janvier 1996
lundi, 2 janvier 1996
dimanche, le 31 décembre 1999
...
1.8. Spelled words
É V I N C É
É V O L U T I O N
É X O N É R É
 G É E S
Ê T R E S
A É R E R
A É R I E N
A É R O P O R T
Q U A S I M E N T
Q U A T R I É M E
Q U E R E L L E
Z O N A G E
Z O O M S
B U R E A U
C É L É B R E
C É R A M I Q U E
...
1.9. Application words (complete list)
activer
aide
annonce
annuaire
annuler
appeler
arrêter
astérisque
autres
canal
changer
chaîne
composer
conférence
connecter
continuer
date
dièse
désactiver
effacer
en arrière
encore une fois
enregistrer
externe
fin
information
interne
jouer
lecture
menu
messages
modifier
mémoire
nom
nouveau
nouvelle
numéro
opérateur
pause
programmer
précédant
rappeler
rembobiner
retour
répondeur
répéter
station
stop
suite
suivant
transférer
téléphone
verrouiller
écouter
1.10 Phonetically rich sentences
Le Monde:
ce projet nécessite la construction d'une
énorme écluse.
ce projet est accompagnée d'un projet immobilier
correspondant à 4,500 habitants.
il est actuellement étudié par trois
architectes locaux.
les sondages archéologiques sont terminés
depuis fin janvier.
une société d'économie mixte
sera crée fin janvier.
les premiers coups de pioche devraient être
donées à la fin de l'année.
une première tranche sera commercialisée
en 1992.
à Montpellier, la dimension du projet interdit
toute échance précise.
deux ans après, on a annoncé en grande
pompe la création de Port-Mariane.
ce port fluvial se situe au coeur d'un nouveau quartier
de 20,000 habitants à l'est de la ville.
ce nouveau quartier est aéré par trois
espaces verts. seule apparait comme définitive aujourd'hui
le debut de la tion de la future mairie.
...
...
1.11. Application words embedded in carrier sentences
il renonce à prendre sa retraite à cette date.
vous effacer la prochain chanson sur la cassette.
effacez les erreurs présentes dans ca texte.
cette démarche novatrice a mis fin à ces pratiques dépassés.
le Parc national de Cévennes a apportée une aide technique.
le lundi suivant, le volume des transactions doublait.
le président de la République leur a fait parvenir un message.
...
...
Description of the structure of the corpus as it appears on the CD-ROMs
[There is a file which gives he structure of the
corpus as it appears on the CD-ROMs].
1. NUMBER OF SESSIONS
1,302 sessions (calls) were recorded and processed.
Of these, 1000 sessions are put on the CD-ROMs for SpeechDat(M).
The remaining 302 sessions have been transferred to Philips Research
Aachen separately.
2. DIVISION OF THE SESSIONS
The SpeechDat(M) corpus for the French language contains
1,000 sessions. It is delivered on three (3) CD-ROMs.
The phonetically rich sentences that occur within
each session are put on a separate CD-ROM. The other two CD-ROMs
contain the remaining items.
3. COMPRESSION AND STRUCTURE
The data on the CD-ROMs are stored as gzipped, A-law
compressed waveforms. Each waveform file contains a single item.
The files do not contain any header in accordance with the ESPRIT
Project SAM standards. Instead, each signal file is accompanied
by a label file in which the relevant information is stored.
The complete corpus consists of 3 CD-ROMs. The speakers
are stored in arbitrary order. Male and female speakers are not
stored in separate directory trees.
4. DIRECTORY STRUCTURE
Blocks of 100 sessions (directories) are created
to ensure the manageability of the large number of files. Each
session contains all speech files of a single speaker.
The final directory structure of the CD-ROM has four
levels:
\<database_name>\<volume>\<block>\<session>\<file>
<database_name> is defined as <name><#><language
code>, where <name> can be FIXED, MOBIL or VERIF, anticipating
Speechdat main phase requirements for fixed, mobile, speaker verification
databases. <#> is 0 for SpeechDat(M), 1 for SpeechDat main
phase. <language code> is the ISO 2-letters code for the
language that is recorded. In the case of this SpeechDat(M) French
corpus the database name is FIXED0FR.
<volume> is defined as CD<vv>, where
<vv> is a progressive number from 00 to 02, specifying the
physical CD-ROM containing the material
<block> is defined as BLOCK<nn> where
<nn> is a progressive number from 00 to 99. One block contains
100 calls.
<session> is defined as SES<nnxx> where
<nn> is the same as in block and <xx> is a number
between 00 and 99. This is the numeric call sequence identification
number, which is also encoded in each filename. As there are no
more than 50 utterances per call, the total number of speech files
and associated transcription files does not exceed the CD-ROM
recommended limit of approximately 100 files in a directory.
<file> is defined as 'A0nnxxcc.FRf', where A0 stands for SpeechDat(M), fixed.
<nnxx> is the same as in session.
<cc> is the item code. It is a unique code for each item:
A1-A9: 9 application words out of a vocabulary of 54 words
C1: 6 digit ID number
C2: 8 digit telephone number
C3: 16 digit credit card number
D1-D3: 3 dates
E1-E3: 3 carrier sentences with application words
I1: isolated digit
L1-L3: 3 spelled words
M1: large money amount
M2: small money amount
N1: natural number
P1: place name
Q1-Q4: 4 yes/no questions
R1: comments
S1-S9: 9 phonetically rich sentences
T1-T3: 3 times
W1-W3: words, the same as the spelled words, but
now spoken as normal words
<f> is O for Orthographic, A for A-law and
Z for compressed. In the case of this corpus all speech files
are compressed using 'gzip', thus having the Z at the end of each
file name. When uncompressing the files the Z should be changed
into an A.
5. LABEL FILE
Each (compressed) speech data file is accompanied
by a label file. It contains information about recording and
speaker conditions of the speech data file. The label file is
in SAM format. An example can be found below:
LHD: SAM, 5.00
DBN: SpeechDat(M)_French
VOL: FIXED0FR_01
SES: 0864
SHT: 0969
REG:
SEX: F
AGE: 21
DIR: \FIXED0FR\CD01\BLOCK08\SES0864
SRC: A00864A4.FRZ
CCD: A4
REP: SPEX, LEIDSCHENDAM, THE NETHERLANDS
RED: 23/Jun/1995
RET: 21:36:28
ASS: OK
BEG: 0
END: 32511
SAM: 8000
SNB: 1
SBF:
SSB: 8
QNT: A-LAW
CMP: GZIP 1.2.4
LBD:
LBR: 0, 32511, , , , enregistrer
LBO: 0, 16256, 32511, [bip] enregistrer
ELF:
This concerns the 4th application word, read from
sheet nr. 0969, of session 0864, to be found on CD01.
The meaning of the mnemonics is defined and explained
in deliverable LRE 63314 D 1.4.1 to be obtained from the WWW site
http://www.phonetik.uni-muenchen.de/SpeechDat.html
6. SPEAKER.SAM
In \FIXED0FR\TABLE the file SPEAKER.SAM can be found
with information about each speaker. Information concerns the
session ID corresponding with this speaker, age and sex of the
speaker, the mother tongue of the speaker (French or not) and
the region where the speaker went to school for the first time.
An example:
SES: 0000
SEX: M
AGE: 30
NLN: FRENCH
ACC: Picardie
Note: as key for each speaker, we used SES instead
of SCD (speaker code).
7. CONTENTS.LST
In the same directory \FIXED0FR\TABLE there is the
file CONTENTS.LST containing information of each file on the CDs.
All the information is also present in the label file going with
speech file. The following fields are present, with their mnemonic
in the label file between brackets:
CDROM volume name (VOL:)
full pathname (DIR:)
speech file name (SRC:)
speaker code (SCD:)
speaker sex (SEX:)
speaker age (AGE:)
assessment (ASS:)
orthographic transcription of the uttered item (LBO:)
The seventh field, assessment, comes instead of region
(REG:), because the region of call is unknown.
An example:
FIXED0FR_00 \FIXED0FR\CD00\BLOCK00\SES0000 A00000C1.FRZ
0000 M 30 NOISE [bruit/] [bouche] cinq deux
trois deux sept cinq [/bruit]
FIXED0FR_00 \FIXED0FR\CD00\BLOCK00\SES0000 A00000Q1.FRZ
0000 M 30 OK oui [bruit]
FIXED0FR_00 \FIXED0FR\CD00\BLOCK00\SES0000 A00000Q2.FRZ
0000 M 30 OTHER [bouche] non non
FIXED0FR_00 \FIXED0FR\CD00\BLOCK00\SES0000 A00000Q3.FRZ
0000 M 30 OK [bouche] non pas vraiment
FIXED0FR_00 \FIXED0FR\CD00\BLOCK00\SES0000 A00000D1.FRZ
0000 M 30 OK [bouche] le dix-neuf juin:
soixante quatre [bruit]
FIXED0FR_00 \FIXED0FR\CD00\BLOCK00\SES0000 A00000P1.FRZ
0000 M 30 OK [bouche] le Nord [bruit]
FIXED0FR_00 \FIXED0FR\CD00\BLOCK00\SES0000 A00000Q4.FRZ
0000 M 30 OK oui [bruit]
FIXED0FR_00 \FIXED0FR\CD00\BLOCK00\SES0000 A00000C2.FRZ
0000 M 30 OK [bouche] trente-deux cinquante-neuf
trente-deux cinquante-six
FIXED0FR_02 \FIXED0FR\CD02\BLOCK00\SES0000 A00000S1.FRZ
0000 M 30 OK [bouche] ils sont pourtant
partis en même temps de la même nébuleuse [bruit]
FIXED0FR_02 \FIXED0FR\CD02\BLOCK00\SES0000 A00000S2.FRZ
0000 M 30 OK [bouche] l'irascible new-yorkais
a en effet une nouvelle fois cédé le vingt et un
janvier à son tempérament colérique [bouche]
8. HANDSET.LST
Also in the directory \FIXED0FR\TABLE there is the
file HANDSET.LST, which gives of each session the type of handset
that is used. This is either 'CORDED' or 'CORDLESS'.
9. LEXICON.TBL
The file LEXICON.TBL, also to be found in \FIXED0FR\TABLE
contains the phonemic representation (in SAMPA) of each word form
on the CDs. An extract:
demande 57 d @ m a~ d
demandent 5 d @ m a~ d
demander 16 d @ m a~ d e
demanderaient 4 d @ m a~ d @ R &/
demanderons 1 d @ m a~ d @ R o~
demanderont 1 d @ m a~ d @ R o~
demandes 22 d @ m a~ d
demandeurs 7 d @ m a~ d 9 R
demandez 2 d @ m a~ d &/
demandiez 1 d @ m a~ d i E/
demandons 10 d @ m a~ d o~
demandé 27 d @ m a~ d &/
The layout is word, frequency, phonemic representation,
separated by tabs. The symbols of the phonemic representation
are separated by spaces.
10. SUMMARY.TXT
The file SUMMARY.TXT, to be found in \FIXED0FR\DOC
gives a listing of all the items on the CD. Whereas contents.lst
gives all items on all CD's, summary.txt gives only the items
that are on one specific CD. The format of summary.txt is full
path name, session number, items, recording date, recording time all separated by tabs.
The items that are present are printed in one string, if an item
is not present two hyphens ('--') are printed in its place.
An example:
\FIXED0FR\CD00\BLOCK00\SES0000 0000 A1A2A3A4A5A6A7A8A9C1C2C3D1D2D3E1E2E3I1L1L2L3M1M2N1P1Q1Q2Q3Q4R1T1T2T3W1W2W3
26/Apr/1995 11:34:16
\FIXED0FR\CD00\BLOCK00\SES0001 0001 A1A2A3A4A5A6A7A8A9C1C2C3D1D2D3E1E2E3I1L1L2L3M1M2N1P1Q1Q2Q3Q4R1T1T2T3W1W2W3
27/Apr/1995 15:09:44
\FIXED0FR\CD00\BLOCK00\SES0002 0002 A1A2A3A4A5A6A7A8A9C1C2C3D1D2D3E1E2E3I1L1L2L3M1M2N1P1Q1Q2Q3Q4R1T1T2T3W1W2W3
27/Apr/1995 15:23:46
\FIXED0FR\CD00\BLOCK00\SES0003 0003 A1A2A3A4A5A6A7A8A9C1C2C3D1D2D3E1E2E3I1L1L2L3M1M2N1P1Q1Q2Q3Q4R1T1T2T3W1W2W3
28/Apr/1995 14:31:06
\FIXED0FR\CD00\BLOCK00\SES0004 0004 A1A2A3A4A5A6A7A8A9C1C2C3D1D2D3E1E2E3I1L1L2L3M1M2N1P1Q1Q2Q3Q4R1T1T2T3W1W2W3
01/May/1995 22:49:26
\FIXED0FR\CD00\BLOCK00\SES0005 0005 A1A2A3A4A5A6A7A8A9C1C2C3D1D2D3E1E2E3I1L1L2L3M1M2N1P1Q1Q2Q3Q4R1T1T2T3W1W2W3
04/May/1995 12:12:22
\
11. MISSING.TXT
Also in the directory \FIXED0FR is the file MISSING.TXT, in which a list of all missing items can be found.
RECORDING PLATFORM: FUNCTIONAL DESIGN
1. INTRODUCTION
The recordings of the French SpeechDat(M) corpus
are done on the same recording platform as was used for the Dutch
Polyphone corpus. For the French recordings an international green
number was hired. The lines enter the Netherlands digitally and
they remain digital. The recordings are done on an OS/2 based
PC. The application is developed by means of the "Show-'N-Tel"
application generator. The underlying hardware is a combination
of a Rhetorex (RDSP16000) voice board and a ACULAB (1TR6 ISDN-30)
telephone interface.
Apart from this OS/2 based recording platform, a
UNIX computer is used for permanent storage of sampled data. This
environment provides 16 independent input lines, from which calls
can be recorded and stored.
This report defines the recording process for SpeechDat
(M) French. Specific information about the hardware can be found
in appendix A of this document.
2. OVERVIEW OF THE RECORDING PROCESS
Three different phases in the recording process of
individual calls may be discerned:
1. Initial setup: During the initial setup, a unique
session-ID is generated, the available space on the platform is
checked and the directories are created, which must be available
for the succeeding phases.
2. Recording: During the recording phase, the prompts
are produced and the tokens are recorded.
3. Transfer to permanent storage: During the transfer
to permanent storage, a set of files corresponding to exactly
one speaker is transferred from PC to a UNIX machine.
These actions are preceded by a one-time preparation
phase, in which the recording protocol is defined, the system
prompts are recorded, the speech material is defined, subjects
are recruited, etc.
One of the features of the platform is that it has
a ISDN-30 telephone interface and, thanks to the voice board,
it has the capacity to handle 16 recording sessions simultaneously.
A recording session starts with an incoming call on one of the
30 time slots of the telephone interface. If less than 16 lines
are active, the call is answered. If all 16 lines are already
active, the call is not answered and the caller gets a busy tone.
If the call is answered, the available disk space on the recording
platform is checked. If the amount of disk space is under a specified
minimum, an appropriate message is played to the caller and the
system disconnects the call. For each call that is accepted, a
unique, 6-digit session-ID is generated. This is done by a separate
program, compiled c-code, that is called from the "Show-'N-Tel"
script. A directory for this particular session is then created.
After this initialization, a short introductory message
is played, followed by a sequence of prompts for utterances to
be recorded. To do this, the program first reads a database to
determine the token type for the token to be recorded. This database
is actually a text file, containing two lines for each token to be recorded.
The first line corresponds to the token-ID and is a two- digit
string, while the second line contains either the term "word",
"sentence" or "longsentence". This information is used
to differentiate between the recording of (short) words or (longer)
sentences.
The system messages are all stored in files in a
single directory. There are guidance messages, such as introductory
phrases, and prompts, intended to elicit a vocal response from
the speaker. The prompts are all stored in files, with names corresponding
to the token-ID. The system messages are spoken by two different
voices. The prompts for utterances are spoken by a female speaker.
The system messages that are used as guidance messages are spoken
by a male speaker.
The recorded vocal responses of the callers are stored
in files with names 'token-id.smp'. After the last utterance has
been recorded, a short message is played and the connection is
closed. After this hangup, a file is created with the name: 'session-id.ses'.
A background job periodically detects the existence of this file
and figures out the session-id from the filename. The files that
correspond to this session are then transferred for permanent
storage. This transfer is done via ethernet using NFS (for OS/2).
3. INTERACTION BETWEEN CALLERS AND THE RECORDING PLATFORM
The core of the application lies in the prompting
for utterances and the actual recording of those utterances. This
section describes several details regarding this aspect of the
application.
* Recording is terminated either by a time-out, which
occurs if the maximum duration for a vocal response is exceeded,
or by a silence detection. If the maximum duration is reached,
the recording is skipped, so in practice all recordings are terminated
by silence detection. For this a silence interval is specified.
If a silence interval of this duration is detected, recording
is terminated. If the size of the resulting speech file is not
larger than the size of the silence interval, the file is disscarded
and no speech has been detected. If the size of the file is indeed
larger, an utterance is taken to have been recorded (minimum duration
is 0.125 seconds).
* Utterances are divided into two categories: words
and sentences. Words have a corresponding silence interval of
1.92 seconds. Sentences have an interval of 2.56 seconds. The
maximum duration for a vocal response is for words 12 seconds,
for sentences 30 seconds.
* Immediately after a prompt has been played, the
recording starts. The silence before the utterance is retained.
The maximum duration is determined by the recording silence interval:
1.92 and 2.56 seconds for words and sentences respectively. If
the caller does not respond to a prompt within the duration of
the silence interval, a message is played and the prompt is repeated.
If the caller still does not respond to the prompt, a message
is played followed by the next prompt. If three tokens in a row
have been skipped, a message is played and the session is aborted.
* It is not possible to detect if a caller speaks
too early. Therefore, no specific action is taken.
* Recording an utterance continues until a period
of silence has been detected. The duration of this period of silence
depends on the type of utterance to be recorded: word or sentence.
If a recording exceeds the maximum duration for this particular
type of utterance, the recording is ended, a message is played
and the prompt is repeated. If the caller responds with a new
utterance which exceeds the allotted time, another message is
played, followed by the next prompt. The recording though is retained.
If three consecutive prompts are answered with utterances exceeding
the allowed time frame, a message is played and the session is
aborted.
Because the silence detection in the recording platform
was upset by the DC offset in the telephone connections and background
noise when subjects called from noisy factory floors, a large
number of calls were aborted too early.
Therefore, it was decided to adapt the recording
protocol to these conditions. The new protocol took effect as
of June 6, 1995: Using 390 complete sessions the average duration
of the individual items was computed. As long as silence detection
appeared to work properly, the original protocol was followed.
However, as soon as end-of-utterance had failed, the recording
platform switched to the alternative protocol, in which recording
continued for the average duration of this item (based on the
first 390 calls) plus the 'silence' duration for the type of stimulus
(1.92 ms for words and 2.56 ms for sentences).
When the new protocol took effect, a new assessment
value PLATFORM was added; it was given to those items for which
end-of-utterance detection did not work, but that did not contain
disturbing background noise.
* If a caller hangs up before the final item has
been recorded, the session is aborted.
* After the final utterance has been recorded, a
message is played and the connection is terminated.
4. EXCEPTION HANDLING IN THE SCRIPT
The script itself is very straightforward. Few things
can go wrong. No branches are present, except those corresponding
to exceptions. The exceptions in the interaction with caller mentioned
above, are identified in the script as follows:
- Silence: A caller does not respond to a prompt.
- Babble: A caller's response takes up more than the allotted time.
- Hangup: A caller disconnects, before all utterances have been recorded.
- Diskfull: The disk of the recording platform is full at the start of a session.
5. SAMPLED DATA FILE FORMAT AND FILE NAMES
Each utterance is stored as a sampled data file.
The sampled data files contain 8-bit A-law coded PCM samples,
64 kByte/s, sampled at 8000 Hz. The sampled data files contain
no header and the names are derived from the prompt-ID. Prompts
1 through 50 result in files "1.smp" through "50.smp",
and the files are stored in a directory denoted by a session-ID.
APPENDIX A: INSTALLED HARDWARE AND SOFTWARE ON RECORDING PLATFORM
A.1. The ISDN-connection
The recording platform for SpeechDat (M) French is
based on a ISDN-30 connection with 1TR6 signalling (i.e. a primary
rate German ISDN, 2 Mbs connection). The Dutch PSTN infrastructure
guarantees that a speech signal with an ISDN connection as its
destination remains in an A-law coded digital form after the first
major network switch that it encounters. All 30 lines of the connection
can be reached by one telephone number.
The Aculab MVIP/PEB E1/G703 PC card provides a means
of connection between the telephone network and various kinds
of PC based speech and data processing cards. The E1 card may
handle up to 30 separate calls at one time. Processors on the
E1 card control all of the call signalling (call setup, call acceptance,
call clearing etc.) in response to commands from an application
program running on the host computer. One such information element
that is available is the number used by the caller on calling
into the card, the DDI number.
There is a requirement that audio signals over the
E1 Card transmitted are encoded using CCITT A-law PCM. Connection
between the E1 card and the ISDN network termination port is via
the Line Interface Unit that provides the high voltage isolation
and EMC protection.
A.2. The Voice Processing Platform
The Rhetorex RDSP/16000 Voice Processing Platform
is a sixteen telephone port single slot voice processing board
for digital line interfaces. It can be installed in IBM PC/AT
or ISA bus compatible computers and must be connected to the digital
telephony interface card via the MVIP bus. Also included is a
512 channel non-blocking switch matrix that provides Enhanced
Compliant MVIP switching. All of the available channels can be
routed to the MVIP bus for transfer to the Rhetorex RDSP card,
as well as other MVIP resource cards, such as voice recognition,
FAX, or subscriber line cards. The matrix is also capable of acting
as a central switch by switching data from one MVIP card to another.
A.3. Configuration
The platform is based on a PC, type Compaq Prolinea
4/33i, 33 MHz, 250 Mb Hard Disk, 16 Mb RAM. The operating system
installed on the PC is OS/2 version 2.1.
The fully installed PC contains three add-on ISA
boards. These three boards are:
1. 1TR6-card, ACULAB: I/O-address: 380
base address dual ported memory: 0xD000:0000 (64KBytes)
IRQ: 5
2. RDSP16000-voicecard, Rhetorex: I/O-address: 390
base address dual ported memory: 0xC800:0000 (4KBytes)
IRQ: 3
3. ETHERNET-card, RACAL-MILGO: I/O-address: 300
base address dual ported memory: 0xCC00:0000 (16KBytes)
IRQ: 4
A.4. Installed software
The following software is installed on the platform:
- OS/2, version 2.1
- Rhetorex drivers for voice board
- ACULAB drivers for 1TR6 card
- Show-'N-Tel application generating software
- Borland C for OS/2
- Perl programming language for OS/2
- TCP-IP with NFS for OS/2