RSS twitter Login
Home Contact Login


Share this page!
twitter google-plus linkedin share

A general study of evaluation methods, measurement and related projects through different language technologies. 

Please scroll horizontally on the right arrow (or on the left arrow) to see the tabs that are not displayed.

Speech Synthesis, also often referred to as Text-To-Speech (TTS) processing, consists in converting written input into spoken output by automatically generating synthetic speech.

TTS systems generally consist of 3 modules:

  1. Text Processing,
  2. Prosody Generation,
  3. Acoustic Synthesis.


Text Processing

The first step in a TTS system is text processing. The input text is analyzed and transformed into a linguistic representation containing all the necessary information needed in the subsequent TTS steps. Typical text processing operations are:

  • Special words or symbols (numbers, acronyms, abbreviations, etc.) are identified in the input text and normalized (usually expanded in full text form).
  • Each word in the input text is assigned a part-of-speech category (a POS tag) that determines its grammatical function.
  • A phonetic lexicon and a set of rules are used to produce the appropriate phonetic transcription of the input text (Grapheme-to-phoneme conversion).


Prosody Generation

Prosody is the set of speech features that makes that a same phonetic sound can be uttered in very different ways. These features include intonation (tone, pitch contour), speech rate, segment duration, phrase break, stress level and voice quality. Prosody plays a fundamental role to elicit the meaning, attitude and intention and to produce natural speech.
The objective of the prosodic TTS module is to generate prosodic features that will make the intonation of the final synthesized speech as close as possible to a natural human voice intonation. In most TTS applications it is of essentiel to produce expressive speech.

Acoustic Synthesis

The acoustic module physically generates the final speech signal (the synthesized voice) by implementing the appropriate sequence of phonetic units and the desired prosodic features resulting from the previous, afore-mentioned processing steps.


A first approach is to evaluate separately the components of these different modules (glass box evaluation):

  • Evaluation of the Text Processing components,
  • Evaluation of the Prosody Generation module,
  • Evaluation of the Acoustic Synthesis module.

Another (complementary) approach consists in measuring the global overall quality of the synthesized speech (black box evaluation).

TTS evaluation campaigns generally combine both approaches to investigate all objective and subjective aspects of speech synthesis technologies.

The complexity of TTS evaluation comes from the fact that it consists of separate evaluation tasks, each requiring a specific protocol and test collection.

In addition, other specific methods are required to evaluate other TTS-related research tasks (voice conversion, expressive speech synthesis, etc.).


Objective Evaluation

The evaluation of the text processing components is done through automatic metrics (objective measures) by comparing the outputs with a reference:

  • Normalization of Non-Standard-Words (NSWs): Word Error Rate (percentage of words not correctly disambiguated);
  • End-of-Sentence Detection: Sentence Error Rate (percentage of sentences not correctly segmented);
  • POS Tagging: POS-tag Error Rate (percentage of incorrect tags);
  • Grapheme-to-Phoneme Conversion: Phoneme Error Rate (percentage of erroneous phonemes) and Word Error Rate (percentage of words containing at least one erroneous phoneme).


Subjective Listening Tests

The global (black-box) evaluation and the evaluation of the other modules (Prosody and Acoustic Synthesis) mainly rely on subjective tests conducted by human judges.

A typical subjective evaluation procedure is as follows:

  • Test sentences (input text) are processed by the system.
  • Resulting synthesized speech excerpts are collected.
  • Subjective judgment tests are performed by human listeners.

Subjects are asked to rate the quality of the synthesized sentences they listen to, according to a series of pre-defined criteria (naturalness, intelligibility, pleasantness, etc.). The TTS systems or modules under scrutiny are compared based on these scores.


ECESS (European Centre of Excellence in Speech Synthesis) 

Festvox is CMU’s TTS project. It organizes the Blizzard Challenge 

MBROLA Project

MUSSLAP (Multimodal Human Speech and Sign Language Processing for Human-Machine Communication) 

HUMAINE (Human-Machine Interaction Network on Emotion) 

TC-STAR (Technology and Corpora for Speech to Speech Translation) included a text-to-speech task 

EvaSy (in French "Evaluation des systèmes de Synthèse de parole": Speech Synthesis System Evaluation): Evaluation of speech synthesis in French.


SSW-7: 7th ISCA Speech Synthesis Workshop

Blizzard Challenge: 20102009, 2008, 200720062005.

ISCA Speech Synthesis Workshops (SSW): SSW7, SSW-6, SSW-5, SSW-4SSW-3SSW-2SSW-1.


MBROLA: a toolkit to build TTS systems in many different languages.

Festival: University of Edinburgh’s Festival Speech Synthesis Systems is a free software multi-lingual speech synthesis workbench.

Festvox tools: Festvox documentation and scripts.

Praat: speech analysis, synthesis, and manipulation package which can perform general numerical and statistical analysis.


TC-STAR-TTS Evaluation Package: distributed via the ELDA/ELRA catalogue.

Evasy Evaluation Package: distributed via the ELDA/ELRA catalogue.




  • H. Höge, Z. Kacic, B. Kotnik, M. Rojc, N. Moreau, H.-U. Hain, "Evaluation of Modules and Tools for Speech Synthesis - The ECESS Framework - ", LREC 2008, Marrakech, Marocco, 2008.
  • Luengo, I., Saratxaga, I., Navas, E., Hernáez, I., Sanchez, J., Sainz, I., "Evaluation of Pitch Detection Algorithms under Real Conditions", ICASSP 2007, Honolulu, Hawaii, USA, 2007.
  • A. Bonafonte, H. Höge, I. Kiss, A. Moreno, U. Ziegenhain, H. van den Heuvel, H.-U. Hain, X. S. Wang, M. N. Garcia, "TC-STAR: Specifications of Language Resources and Evaluation for Speech Synthesis", LREC 2006, Genoa, Italy, 2006.
  • Mostefa Djamel, Garcia Marie-Neige, Hamon Olivier, Moreau Nicolas (2006). "Evaluation report, Technology and Corpora for Speech to Speech Translation (TC-STAR) project". Deliverable D16, June 2006.


Multimodal technologies refer to all technologies combining features extracted from different modalities (text, audio, image, etc.).

This covers a wide range of component technologies:

  • Audiovisual Speech Recognition.
  • Audiovisual Person Identification.
  • Audiovisual Event Detection.
  • Audiovisual Object or Person Tracking.
  • Biometric Identification (using face, voice fingerprints, iris, etc.).
  • Head Pose Estimation.
  • Gesture Recognition.
  • Multimodal Information Retrieval (e.g. Video Retrieval).



There is no generic evaluation approach for such a wide and heterogeneous range of technologies. In some cases, the evaluation paradigm is basically the same as for the equivalent mono-modal technology (e.g. traditional IR vs. multimodal IR). For very specific applications (e.g. 3D person tracking in a particular environment), ad hoc evaluation methodologies have to be pre-defined before the start of the evaluation campaign.

A good example of this is the multimodal evaluation framework set up for the CHIL project (Computers in the Human Interaction Loop). Different test collections (production of ground truth annotations) and specific evaluation metrics were defined to address a large range of audio-visual technologies:

  • Acoustic speaker identification & segmentation
  • Acoustic emotion recognition
  • Acoustic event detection
  • Speech activity detection
  • Face and Head tracking
  • Visual Person tracking
  • Visual Speaker Identification
  • Head Pose Estimation
  • Gesture Recognition
  • Multimodal Person Identification
  • Multimodal Person Tracking

For a complete overview of these tasks, see the book that was published at the end of the project.


Related Projects

  • AMIDA: a EU FP7 project, follow-up of FP6 AMI project
  • QUAERO: Germano-French collaborative research and development program, centered at developing multimedia and multilingual indexing and management tools
  • TRECVid: Digital Video Retrieval evaluations at NIST.
  • ImageCLEF Campaigns: Cross-language image retrieval track within the Cross Language Evaluation Forum (CLEF).
  • CHIL Project: Computers in the Human Interaction Loop (IST-2002-506909).
  • AMI Project: Augmented Multi-party Interaction.
  • VACE (Video Analysis and Content Extraction): a US program including evaluations of object detection and video tracking technologies.
  • SIMILAR: European Network of Excellence on human machine interfaces.
  • HUMAINE: Human-Machine Interaction Network on Emotion (IST-2002-507422).
  • TECHNO-VISION, a French program that included several vision-related evaluation campaigns:
    • ARGOS: evaluation campaign for surveillance tools of video content
    • EPEIRES: Performance Evaluation of Symbol Recognition Methods
    • ETISEO: videosurveillance
    • EVALECHOCARD: medical imaging
    • IMAGEVAL : image processing technology assessment
    • IV2: Biometric iris and face identification
    • MESSIDOR: Methods to evaluate segmentation and indexing techniques in the field of retinal ophthalmology
    • RIMES : evaluation campaign for handwritten document processing
    • ROBIN : evaluation of object recognition algorithms
    • TOPVISION: submarine imaging systems.
  • BioSec Project on Biometrics and Security (IST-2002-001766)



  • ImageCLEF 2010: 2010 cross-language image retrieval evaluation campaign.
  • MIR 2010 (ACM SIGMM International Conference on Multimedia Information Retrieval)
  • CBMI’2010 (8th International Workshop on Content-Based Multimedia Indexing)
  • CIVR 2010 (ACM International Conference on Image and Video Retrieval)
  • CLEAR (Classification of Events, Activities and Relationships) evaluations :
    • CLEAR’07 included the following tasks: Person Tracking (2D and 3D, audio-only, video-only, multimodal), Face Tracking, Vehicle Tracking, Person Identification (audio-only, video-only, multimodal), Head Pose Estimation (2D, 3D), Acoustic Event Detection and Classification
    • CLEAR’06 included the following tasks: Person Tracking (2D and 3D, audio-only, video-only, multimodal), Face Tracking, Head Pose Estimation (2D, 3D), Person Identification,(audio-only, video-only, multimodal), Acoustic Event Detection and Classification.
  • Past ImageCLEF campaigns: ImageCLEF 2009Image CLEF 2008, Image CLEF 2007, Image CLEF 2006, Image CLEF 2005, Image CLEF 2004, Image CLEF 2003
  • VideoRec’08: International Workshop on Video Processing and Recognition.
  • VideoRec’07: First International Workshop on Video Processing and Recognition.
  • VP4S-06: First International Workshop on Video Processing for Security.
  • PETS: Performance Evaluation of Tracking and Surveillance: PETS’2006 (Surveillance of public spaces, detection of left luggage events), PETS’2005 (Challenging detection/tracking scenes on water.), PETS’2004 (people tracking), PETS’2003 (Outdoor people tracking - football data), PETS’2002 (Indoor people tracking (and counting) and hand posture classification), PETS’2001 (Outdoor people and vehicle tracking), PETS’2000 (Outdoor people and vehicle tracking)



  • TRECVID test collections are available from the LDC catalogue.
  • IAPR TC-12 is a free test collection for image retrieval containing still natural images with text captions in up to three different languages (English, German and Spanish).



  • Computers in the Human Interaction Loop, Alexander Waibel and Rainer Stiefelhagen (Ed.), Springer London, 2009.
  • Moreau N., Mostefa D. Stiefelhagen R., Burger S. and Choukri K. (2008). "Data Collection for the CHIL CLEAR 2007 Evaluation Campaign", In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC08), May 2008, Marrakech, Morocco.
  • Mostefa D., Moreau N., Choukri K., Potamianos G., Chu S., Tyagi A., Casas J., Turmo J., Cristoforetti L., Tobia F., Pnevmatikakis A., Mylonakis V., Talantzis F., Burger S., Stiefelhagen R., Bernardin K. and Rochet C. (2007). The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms In Language Resources and Evaluation, Vol. 41, No. 3. 16 December 2007, pp. 389-407.
  • Stiefelhagen R., Bernardin K., Bowers R., Garofolo J., Mostefa D. and Soundararajan P. (2007). The CLEAR 2006 Evaluation In Multimodal Technologies for Perception of Humans, Lecture Notes of Computer Science, Volume 4122/2007, pp 1-44, 2007.

Information Retrieval (IR) deals with the representation, storage, organization of, and access to information items. The representation and organisation of the information items should provide the user with easy access to the information in which he is interested.

IR systems allow a user to retrieve the relevant documents which (partially) match his information need (expressed as a query) from a data collection. The system yields a list of documents, ranked according to their estimated relevance to the user’s query. It is the user’s task to look for the information within the relevant documents themselves once they are retrieved.
An IR system is generally optimized to perform in a specific domain: newswire, medical reports, patents, law texts, etc.

In recent years, the impressive growth of available multimedia data (audio, video, photos…) has required the development of new Multimodal IR strategies in order to deal with:
- annotated image collections (images with captions, etc.).
- multimedia documents combining text and pictures.
- speech transcriptions (e.g. transcribed TV programs), etc.
Multimedia and audio-visual data are processed by combining information extracted from different modalities: text, audio transcriptions, images, video key-frames, etc.

Moreover, in a globalized world, IR systems have more and more to cope with multilingual information sources. In a multilingual context, we talk of Cross-Language Information Retrieval (CLIR) (See Moreau, Nicolas et al. Best Practices in Language Resources for Multilingual Information Access, Public report of the TrebleCLEF project (Deliverable 5.2), March 2009). The language of the query (source language) is not necessarily the same as the language(s) used in the documents (target language(s)).

Question Answering (QA) is another, particular approach to IR. In a QA system information needs are expressed as natural language statements or questions. In contrast to classical IR where complete documents are considered relevant to the information need, QA systems return concise answers. Often, automated reasoning is needed to identify potential correct answers.
The explosive demand for better information access for a large public of users fosters the R&D for QA systems. The interest of QA is to provide inexperienced users with a flexible access to information allowing them to write a natural question and obtain directly a concise answer.

Other types of applications are often considered to be part of the IR domain: Information Extraction, Document Filtering, etc.


Most IR evaluation campaigns carried out until now rely on a comparative approach. Unlike objective evaluation (How well does a method work?), comparative evaluation focuses on the comparison of the results obtained with different systems (Which method works best?). To be compared, IR systems must be tested under similar conditions.

An IR comparative evaluation usually relies on a test collection consisting of:

  • a set of documents to be searched,
  • a set of test queries,
  • if available: the set of relevant documents for each query.

Once a test collection has been created, the general evaluation methodology is done in 3 main steps:

  1. Evaluation run: each IR system to evaluate searches the test collection using the pre-defined test queries. It yields a ranked list of document for each test query.
  2. Relevance judgments: human evaluators examine each retrieved document and decide if it is relevant or not, i.e. if it satisfies or not the information need expressed by the query. If the set of relevant documents for each query is known a priori, this can be done automatically by comparing the set of retrieved documents with the reference set of relevant documents. .
  3. Scoring: performance measures are computed based on the relevance judgments. .

As long as they are tested on the same test collection (same set of documents and queries) the performance of different systems can be compared based on their final performance measures.

The human relevance judgment step represents the most time- and resource-consuming part of an IR evaluation procedure:

  1. It requires the hiring of a team of objective experts who have to behave as if they were real users, and judge the relevance of each retrieved document with regard to the queries.
  2. A human evaluation framework (computer interface, evaluation guidelines, training sessions) must be carefully designed to ensure that all evaluators work under the same conditions.



Early in 1966, Cleverdon (See Cleverdon, Cyril; Keen, Michael. Factors Affecting the Performance of Indexing Systems, Vol 2. ASLIB, Cranfield Research Project. Bedford, UK: C. Cleverdon, 1966, 37-59) listed six measurable features that reflect users’ ability to use an IR system:

  1. Coverage of information;
  2. Form of output presentation;
  3. Time efficiency;
  4. Effort required for the user;
  5. Precision;
  6. Recall.

In general, the objective evaluation of IR performance relies on the 2 last effectiveness measures (Precision and Recall), based on the number of relevant documents retrieved.

Considering the ranked list of retrieved document for a given query, these 2 values are computed by considering the first N retrieved documents only (let’s call it the N-list):

  • Precision is defined as the number of relevant documents retrieved in the N-list divided by N (i.e. the proportion of relevant items in theN-list of retrieved documents).
  • Recall is defined as the number of relevant documents retrieved in the N-list divided by the total number of existing relevant documents in the collection (i.e. the proportion of retrieved items in the set of all relevant documents).

Precision and Recall measures are computed for different values of N resulting in a Precision/Recall curve that reflects the IR effectiveness.

Usually, a single value metric is derived from the Precision/Recall plots and used as a final indicator of retrieval efficiency. Usual metrics are:

  • Mean Average Precision (MAP);
  • Expected search length (rank of first relevant);
  • E-measure;
  • F-measure.

These metrics can be computed for a single query, but they are generally averaged over the whole set of test queries.

Detailed descriptions of classical IR evaluation measures can be found in [[Ricardo Baeza-Yates, Berthier Ribiero-Neto. Modern Information Retrieval. Addison-Wesley, 1999»>«>

Other specific IR tasks may require other performance measures. For example, the performance of QA systems is measured upon the percentage of correct answers obtained from a set of test questions.


  • TREC (Text Retrieval Conference): TREC-1, TREC-2, TREC-3, TREC-4, TREC-5, TREC-6, TREC-7, TREC-8, TREC-9, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011
  • CLEF (Cross-Language Evaluation Forum), cross-language IR and QA, European languages: 200020012002200320042005,2006200720082009, 2010, 2011
  • NTCIR (NII Test Collection for IR Systems), cross-language IR and QA, Asian languages: NTCIR-1, NTCIR-2, NTCIR-3, NTCIR-4,NTCIR-5, NTCIR-6, NTCIR-7, NTCIR-8, NTCIR-9
  • Quaero Collaborative R&D program on Multimedia Information Retrieval Evaluation (2008-2013).
  • TrebleCLEF 2-year EC project supporting CLEF activities (2008-2010).
  • FIRE (Forum for Information Retrieval Evaluation): IR initiative for Indian languages
  • CLIA (Cross Language Information Access): Consortium project focusing on IR, Summarization, and machine translation for Indian languages
  • INEX (Initiative for the Evaluation of XML Retrieval).
  • QALL-ME (Question Answering Learning technologies in a multiLingual and Multimodal Environment)
  • AQUAINT (Advanced Question Answering for Intelligence)
  • TRECVID Automatic Segmentation, Indexing, and Content-based Retrieval of Digital Video, originated by TREC.
  • CHORUS FP6 Coordination Action on Multimedia Content Search Engines.
  • ImagEVAL 2006 Evaluation of Content-Based Image Retrieval (CBIR) .
  • TIPSTER Document Detection, Information Extraction and Summarization.
  • much.more Cross-lingual information access for the medical domain.
  • AMARYLLIS IR Evaluation for French.
  • EQueR (Evaluation campaign for Question-Answering systems): evaluation of QA in French.
  • TIDES (Translingual Information Detection Extraction and Summarization). TIDES included several evaluation projects:
    • Information Retrieval: HARD (High Accuracy Retrieval from Documents).
    • Information Detection: TDT (Topic Detection and Tracking).
    • Information Extraction: ACE (Automatic Content Extraction).
    • Summarization: DUC (Document Understanding Conference).


  • CLEF 2010 (CLEF Conference). .
  • SIGIR 2010 (Conference of the ACM’s Special Interest Group on Information Retrieval).
  • ECDL 2010 (European Conference on Research and Advanced Technology for Digital Libraries) .
  • NTCIR-8 (the 8th NTCIR Workshop) .
  • ECIR 2010 (European Conference on Information Retrieval) .
  • FIRE 2010 (Conference of the Forum for Information Retrieval Evaluation) .
  • MIR 2010 (ACM SIGMM International Conference on Multimedia Information Retrieval) .
  • CBMI’2010 (8th International Workshop on Content-Based Multimedia Indexing) .
  • CIVR 2010 (ACM International Conference on Image and Video Retrieval) .
  • AIRS 2009 (Asia Information Retrieval Symposium)
  • CLEF 2009 (10th CLEF evaluation Campaign).
  • ECDL 2009 (European Conference on Research and Advanced Technology for Digital Libraries)
  • SIGIR’09 Conference (ACM’s Special Interest Group on Information Retrieval).
  • CIVR 2009 (ACM International Conference on Image and Video Retrieval).
  • JDCL 2009 (Joint Conference on Digital Libraries)
  • ECIR 2009 (European Conference on Information Retrieval)
  • IRF Symposium
  • NTCIR-7 (the 7th NTCIR Workshop)
  • FIRE 2008 (Conference of the Forum for Information Retrieval Evaluation)
  • TRECVID 2008 (TREC Video Retrieval Evaluation Workshop).
  • MIR 2008 (ACM International Conference on Multimedia Information Retrieval).
  • RIAO 2007 (Conference on Large-Scale Semantic Access to Content: Text, Image, Video and Sound)


  • trec_eval is the most commonly used scoring tool for IR evaluations.
  • QASTLE is a tool created in Perl by ELDA to perform human evaluation of Question-Answering systems.

Language Resources

Test Collections

  • CLEF Evaluation PackagesCLEF test suites are distributed through the ELRA Catalogue of Language Resources.
  • TREC Collections: TREC mostly deals with information retrieval in English.
  • NTCIR Test Collections: News corpora in English (Taiwan News, China Times English News, Hong Kong Standard, etc.) and other evaluation corpora: collections of patent application documents, web crawls…
  • Amaryllis Test Collections: news articles in French; plus titles and summaries of scientific articles. Distributed through the ELRA Catalogue of Language Resources.
  • EQueR Test Collections: news articles in French and domain specific medical corpus (scientific articles and guidelines for good medical practice). Distributed through the ELRA Catalogue of Language Resources.

Domain Specific Corpora

  • OSUMED collection in medicine.
  • Cranfield collection in aeronautics.
  • CACM collection (Communications of the Association for Computing Machinery ACM) in computer science;
  • ISI collection (Institute of Scientific Information) in library science also referred to as CISI.
  • Chinese Web test collection, composed of documents, queries and relevance judgments.


Moreau, N. et al., “Best Practices in Language Resources for Multilingual Information Access”, Public report of the TrebleCLEF project (Deliverable 5.2), March 2009.

Cleverdon, C., Keen, M., “Factors Affecting the Performance of Indexing Systems”, Vol 2. ASLIB, Cranfield Research Project. Bedford, UK: C. Cleverdon, 1966, 37-59.

Baeza-Yates R., Ribiero-Neto B., “Modern Information Retrieval“, Ed. Addison-Wesley, 1999.

Machine Translation (MT) technologies convert text from a source language (L1) into a target language (L2). One of the most difficult things in Machine Translation is the evaluation of a proposed system. The problem with language is that language has some degree of ambiguity which makes it hard to run an objective evaluation. For example, with Machine Translation one problem is that there is not only one good translation for a given source text. Van Slype (1979) distinguished {macro evaluation}, designed to measure product quality and {micro evaluation}, assess the improvability of the system. The macro evaluation, also called total evaluation enables comparison of the performance of two translation systems or two versions of the same system. The micro evaluation, also known as detailed evaluation seeks to assess the improvability of the translation system.


The performance of a translation system is usually measured by the quality of its translated texts. Since there is no absolute translation for a given text, the challenge of the machine translation evaluation is to provide an objective and economic assessment. Given the difficulty of the task, most of the translation quality assessments were based on human judgement in the history of MT evaluation. However, automatic procedures allow a quicker, repeatable, objective and cheaper evaluation. Automatic MT evaluation consists in comparing the MT system output to one or more human reference translations. Human scores (manual evaluation) are assigned according to the adequacy, the fluency or the informativeness of the translated text. In automatic evaluation, the fluency and adequacy of MT output can be measured by n-gram analysis.


Some of the most common automatic evaluation metrics are:

Metrics Description Reference
BLEU IBM BLEU for BiLingual Evaluation Understudy is an n-gram co-occurrence scoring procedure. (Papineni et al., 2001)
NIST A variation of BLEU used in NIST HLT evaluation (Doddington, 2002)
EvalTrans Tool for the automatic and manual evaluation of translations (Niessen et al., 2000)
GTM General Text Matcher based on accuracy measures as precision, recall and F-measure (Turian et al., 2003)
mWER Multiple reference Word Error Rate is the average number of MT system output and several human reference translation (Niessen et al., 2000)
mPER Multiple reference Position independent word Error Rate (Tillmann et al., 1997)
METEOR Metric for Evaluation of Translation with Explicit ORdering, based on the harmonic mean of unigram precision and recall (Banerjee & Lavie, 2005)
ROUGE Recall-Oriented Understudy for Gisting Evaluation based on N-gram co-occurrence measure (Lin, 2004)
TER Translation Error Rate (Snover et al., 2006)

For human evaluation, Fluency and adequacy are two commonly used translation quality notions (LDC2002, White et al. 1994). Fluency refers to the degree to which the system output is well-formed according the target language’s grammar. Adequacy refers to the degree to which the output communicates the information present in the reference translation. Recently, other measures have been tested, such as the comprehensibility of a MT translated segment (NIST MT09), or the preference between MT translations of different systems (NIST MT08).




Open-source Machine Translation Systems

  • GenPar Toolkit for Research on Generalized Parsing
  • Apertium open-source machine translation platform
  • JosHUa open-source decoder for parsing-based machine translation
  • Matxin open-source transfer machine translation engine
  • Moses open-source statistical machine translation system

Automatic Metrics

Language Resources @ ELRA


For further information on research, campaigns, conferences, software and data regarding statistical machine translation and its evaluation, please refer to the European Association for Machine Translation

The Machine Translation Archive is also offering a repository and bibliography about machine translation.


  • Lin C.-Y., Cao G., Gao J., Nie J.-Y. (2006) An information-theoretic approach to automatic evaluation of summaries. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, p.463-470, New York, New York
  • Snover M., Dorr B., Schwartz R., Micciulla L., and Makhoul J. (2006) A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Biennial Conference of the Association for Machine Translation in the Americas (AMTA-2006), Cambridge, Massachusetts.
  • Banerjee S. et Lavie A. (2005) METEOR : An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  • Turian J. P., Shen L., and Dan Melamed I. (2003) Evaluation of Machine Translation and Its Evaluation. Proceedings of MT Summit 2003: 386-393. New Orleans, Luisiana.
  • Doddington G. (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of ARPA Workshop on Human Language Technology.
  • Papineni K., Roukos S., Ward T. et Zhu W.-J. (2001) Bleu : a method for automatic evaluation of machine translation. Rapport technique, IBM Research Division, Thomas J. Watson Research Center.
  • Niessen S., Och F. J., Leusch G. et Ney H. (2000) An evaluation tool for machine translation : Fast evaluation for mt reseach. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, Athens, Greece.
  • Tillmann C., Vogel S., Ney H., Zubiaga A., and Sawaf H. (1997) Accelerated DP based search for statistical translation. In Fifth European Conf. on Speech Communication and Technology, pages 2667–2670, Rhodos, Greece, September.
  • White J. S., O’Connel T. A. and O’Maraf (1994) The arpa mt evaluation methodologies : evolution, lessons, and future approaches. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, Columbia, Maryland, USA.
  • Van Slype G. (1979) Critical study of methods for evaluating the quality of machine translation. Rapport technique Final report BR 19142, Brussels : Bureau Marcel van Dijk.

Automatic Summarization aims to extract and present the most important content to the user from an information source. Generally two types of summaries are generated: extract, i.e., a summary which contains text segments copied from the input, and abstract, i.e., a summary consisting of text segments which is not present in the input.

One of summary evaluation issues is that it involves human judgments of different quality criteria like coherence, readability and content. There is no absolute unique correct summary and it is possible that a system output a good summary quite different from a human reference summary (the same problems for machine translation, speech synthesis, etc.).


Traditionally, summarization evaluation compares the tool output summaries with sentences previously extracted by human assessors or judges. The basic idea is that automatic evaluation should collerate to the human assessment.

Two main methods are used for evaluating text summarization. Intrinsic evaluation compares machine generated summaries with human generated summaries, it is considered as system focused evaluation. Extrinsic evaluation measures the performance of summarization in various tasks, and it is also considered as task specific evaluation.

Both methods require significant human resources, using key sentence (sentence fragment) mark-up and human generated summaries for source documents. Summarization evaluation measures provide a ranking score which can be used to compare different summaries of a document.


- Sentence precision/recall based evaluation
- Content similarity measures
- ROUGE (Lin, 2004), cosine similarity, n-gram overlap, LSI (Latent Semantic Indexing), etc.
- Sentence Rank
- Utility measures

Related Projects

NTCIR (NII Test Collection for IR Systems) includes Text Summarization tasks, e.g. MuST (Multimodal Summarization for Trend Information) at NTCIT-7.

TAC (Text Analysis Conference): Recognizing Textual Entailment (RTE), Summarization, etc. 

TIPSTER : See the TIPSTER Text Summarization Evaluation: SUMMAC 
TIDES (Translingual Information Detection Extraction and Summarization).

TIDES included several evaluation projects:

  • Information Retrieval: HARD (High Accuracy Retrieval from Documents).
  • Information Detection: TDT (Topic Detection and Tracking).
  • Information Extraction: ACE (Automatic Content Extraction).
  • Summarization: DUC (Document Understanding Conference). DUC has moved to the Text Analysis Conference (TAC).

CHIL (Computer in the Human Interaction Loop) included a Text Summarization task. 
GERAF (Guide pour l’Evaluation des Résumés Automatiques Français): Guide for the Evaluation of Automatic Summarization in French.


ACL-IJCNLP 2009 Workshop: Language Generation and Summarisation 
TAC 2009 Workshop (Text Analysis Conference). 
Language Generation and Summarisation Workshop at ACL 2009 
RANLP 2009 
CLIAWS3 (3rd Workshop on Cross Lingual Information Access) 
Multi-source, Multilingual Information Extraction and Summarization Workshop at RANLP2007 
TSC-3 (Text Summarization Challenge) at NTCIR-4 
Text Summarization Branches Out Workshop at ACL 2004 
DUC 2003 (HLT-NAACL Text Summarization Workshop)



Lin C.-Y. (2004) ROUGE: a Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26.


The goal of the Speech-to-Speech Translation (SST) is to enable real-time, interpersonal communication via natural spoken language for people who do not share a common language. It aims at translating a speech signal in a source language into another speech signal in a target language.

The evaluation of SST systems can be considered as a extended task of MT evaluation (namely Spoken Language Translation, SLT), in including speech recognition and speech synthesis in the evaluation loop as an end-to-end evaluation.


SLT component usually operate on output produced by ASR component and provide input for the speech synthesis component. The speech translation evaluation can be single component or end-to-end. The former uses the respective component output to provide quality evaluation, while the latter uses the final output of the whole system to provide its quality evaluation.

End-to-end evaluations examine a system in its whole configuration and functionality. Single component evaluations are focused on the different speech translation modules: speech recognition, speech synthesis, and machine translation. The own component metrics are then used, although the interpretation might remains different.


According to different evaluation criteria, several measures can be used for the end-to-end evaluation, which are typically merged into two main categories: the first one estimate the audio quality of the output, while the second one estimate its meaning preservation. The evaluation of the audio quality is rather simple since it uses very similar metrics from the speech synthesis evaluation. Meaning preservation is more complex and can be done either with {subjective} or {objective} measures.

Subjective evaluation uses human judges assessments (from users and/or experts) to compute the loss of preservation between the input, in the source language, and the output, in the target language. Several ways can be employed, like asking questions about the content, rewrite what the judge heard, etc. Generally, the SST system is compared (directly or not) o a reference, likely a human interpreter.

Objective evaluation produces the same kind, but without assessment. One or several experts check the SST output, in going by a reference, in order not to bias the results by human behaviour (such as fatigue, noises, etc.)

Related Projects

TC-STAR: Technology and Corpora for Speech to Speech Translation, 6th FP project (2004-2007).

LC-STAR: Lexica and Corpora for Speech-to-Speech Translation Components (2002-2005).

NESPOLE!: NEgotiating through SPOken Language in E-commerce, 5th FP project (2000-2002).

TONGUES: Rapid Development of Speech-to-Speech Translation System (2000-2002).

Verbmobil: German project on Mobile Speech-to-Speech Translation of Spontaneous Dialogs (1996-2000).



First TC-Star Evaluation Workshop on Speech-to-Speech Translation 
Second TC-Star Evaluation Workshop on Speech-to-Speech Translation 
Third TC-Star Evaluation Workshop on Speech-to-Speech Translation


 TC-STAR 2007 Evaluation Package - End-to-End Spanish-to-English
 TC-STAR 2006 Evaluation Package - End-to-End Spanish-to-English.


For further information on research, campaigns, conferences, software and data regarding speech-to-speech translation and its evaluation, please refer to Machine Translation Archive.


Speech Recognition, also known as automated speech recognition (ASR) or speech-to-text (STT) is a process by which a program or a system transcribes an acoustic speech signal to text.

Systems generally perform two different types of recognition: single-word and continuous speech recognition. Continuous speech is more difficult to handle because of a variety of effects such as speech rate, coarticulation, etc. Today's state-of-the-art systems are able to transcribe unrestricted continuous speech from broadcast data with acceptable performance.


Evaluation of ASR systems is mainly performed by computing the Word Error Rate (WER) or Character Error Rate (CER) for some languages like Chinese or Japanese. 

WER is derived from the Levenshtein distance (or edit distance) and measures the distance between the hypothesis transcription produced by the ASR module and the reference transcription.
The WER is computed after the alignment between the hypothesis and the reference transcriptions have been done by dynamic programming (the optimal alignment being the one which minimises the Levenshtein distance). Usually the costs for insertion, deletion and substitution are respectively 3,3,4.
After alignment between the hypothesis and the reference, WER counts the number of recognition errors.
Three kinds of errors are taken into account when computing the word error rate, i.e. substitution, deletion and insertion errors.

Substitution: a reference word is replaced by another word in the best alignment between the reference and the system hypothesis.

Deletion: a reference word is not present in the system hypothesis in the best alignment.

Insertion: Some extra words are present in the system hypothesis in the best alignment between the reference and the hypothesis.

Although word is the basic unit for assessing ASR systems, the same computation can be made using different granularities (phonemes, syllables, etc.)
WER can be greater than 100%, if the number of errors is more important than the number of words.
Prior to scoring both hypothesis and reference have to be normalized. The normalisation consists of converting the transcription into a more standardised form. This step is language dependent and applies a number of rules for transforming each token into its normalised form. For instance numbers are spelled out, punctuation marks are removed, contractions are expanded, multiple orthographies are converted to a unique form, etc.
Although WER is the main metric for assessing ASR system, its major drawback is that all word errors are equally penalized, regardless the importance and meaning of the word, eg an empty word has the same importance as a named entity.

Performance of ASR systems are also measured in terms of speed by measuring the processing time and computing the real time factor on a specific hardware configuration.
This is an important factor for some applications that may require a real-time processing speed or some devices that are limited in terms of memory or processor speed.


For ASR evaluation, the criterion is recognition accuracy, one commonly used measure is word error rate or the related metric word accuracy rate (WER), also used in machine translation evaluation.

The method used in the current DARPA speech recognition evaluation involves comparing system transcription of the input speech to the reference (i.e., transcription by a human expert), using algorithms to score agreement at the word level. More higher-level metrics such as sentence error rate as concept error rate can be applied regarding different applications.

Communication style (i.e., speaker independent, spontaneous speech, etc), vocabulary size, language model and usage conditions are also important features which can affect the performance of a speech recognizer for a particular task.

Related Projects

NIST Rich Transcription evaluations:

TC-STAR evaluation campaigns for ASR (evaluation packages are available from ELRA’s catalogue)

  • TC-STAR 2007 ASR evaluation campaign: BN and Parliament speeches for Chinese, English and Spanish
  • TC-STAR 2006 ASR evaluation campaign: BN and Parliament speeches for Chinese, English and Spanish
  • TC-STAR 2005 ASR evaluation campaign: BN and Parliament speeches for Chinese, English and Spanish

The ESTER evaluation campaigns (evaluation packages are available from ELRA’s catalogue)

  • The ESTER 1 (2005-2007) evaluation campaign focused on French broadcast news. There were three evaluation tasks, STT, Speaker Diarization and Named Entity Recognition.
  • The ESTER 2 evaluation (2008-2009) campaign focuses on French broadcast news. The same evaluation tasks as in the previous campaign are organized. In addition new experimental tasks such as sentence boundary detection are organised.
  • EVALITA 2009: Connected digit recognition for Italian in clean and noisy environments
  • AURORA: distributed noisy speech recognition evaluation framework for English (evaluation packages are available from ELRA’s catalogue)
  • Aurora 2: a connected digit recognition task under various additive noises
  • Aurora 3: in-car connected digit recognition task
  • Aurora 4: continuous speech recognition

CENSREC (Corpus and Environment for Noisy Speech RECognition ) Japanese noisy speech recognition evaluation framework

  • CENSREC 1 (2003): noisy speech recognition evaluation frameworks
  • CENSREC-1-C (2006): voice activity detection under noisy conditions
  • CENSREC-2 (2005): in-car connected digit recognition
  • CENSREC-3 (2005): in-car isolated word recognition
  • CENSREC-4 (2006): an evaluation framework for distant-talking speech under hands-free conditions, connected digits as in CENSREC 1

CORETEX: Improving Core Speech Recognition Technology 
EARS: Effective Affordable Reusable Speech-To-Text, DARPA’s research program 
NESPOLE: NEgotiating through SPOken Language in E-commerce.


ICASSP’10: International Conference on Acoustics, Speech, and Signal Processing March 14 – 19, 2010, Dallas, USA 
LREC’10: International conference on Language Resources and Evaluation, May 17 – 23, 2010, Malta. 
InterSpeech 2010: September 26 - 30, 2010, Makuhari, JAPAN.


Scoring evaluation tools such as SCLITE are available on NIST’s speech group website: NIST tools


AURORA Project Database 2.0 - Evaluation Package

AURORA Project database - Subset of SpeechDat-Car - Finnish database - Evaluation Package

AURORA Project database - Subset of SpeechDat-Car - Spanish database - Evaluation Package

AURORA Project database - Subset of SpeechDat-Car - German database - Evaluation Package

AURORA Project database - Subset of SpeechDat-Car - Danish database - Evaluation Package

AURORA Project database - Subset of SpeechDat-Car - Italian database - Evaluation Package

AURORA Project Database - Aurora 4a - Evaluation Package

AURORA Project Database - Aurora 4b - Evaluation Package


TC-STAR 2007 Evaluation Package - ASR English

TC-STAR 2007 Evaluation Package - ASR Spanish - CORTES

 TC-STAR 2007 Evaluation Package - ASR Spanish - EPPS

 TC-STAR 2007 Evaluation Package - ASR Mandarin Chinese

TC-STAR 2006 Evaluation Package - ASR English

TC-STAR 2006 Evaluation Package - ASR Spanish - EPPS

TC-STAR 2006 Evaluation Package - ASR Mandarin Chinese

TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES

TC-STAR 2005 Evaluation Package - ASR English

TC-STAR 2005 Evaluation Package - ASR Spanish

TC-STAR 2005 Evaluation Package - ASR Mandarin Chinese

ESTER Evaluation Package

MEDIA Evaluation Package


Multilingual texts alignment consists in identifying correspondences between different text units, e.g., words, sentences, paragraphs, etc. in parallel texts.


The main approach of alignment evaluation is to compare a system-computed alignment with a manually produced reference alignment, usually called a gold standard. Different tasks have been defined in previous evaluation exercises such as Blinker, ARCADE, HLT-NAACL and ACL.


Alignment evaluations were generally performed by using traditional IR measures:

  • Precision
  • Recall
  • F-measure
  • AER (Och and Ney, 2000), Alignment Error Rate, derived from F-measure



  • ARCADE I (1996-1999) and ARCADE II, multilingual text alignment evaluation campaigns (2003-2006).
  • Blinker (1998-2001).






Language Resources @ ELRA


Och F. J. and Ney H. (2000) A Comparison of Alignment models for statistical machine translation. In Proceedings of the 18th International Conference on Computational Linguistics (COLING-ACL 2000), p1086-1090, Saarbrücken, Germany.

Parsing is the process of structuring a linear representation in accordance with a given grammar (Grune and Jacobs, 1990).


The basic idea of parsing evaluation consists in measuring the similarity between the parser-generated tree-structure (also called labelled bracketings) and the manually constructed tree-structure.

Adequacy evaluation involves determining the fitness of a parsing system for a particular task. Efficiency evaluation is to compare the parse time of a given parser on a common test data set with a reference parser.


(Carroll et al. 1998) made the distinction between evaluation methods that are useful in leading the development of a parsing system (intrinsic evaluation) and those that are appropriate for comparing different systems (comparative evaluation). They divided parser evaluation methods into non-corpus and corpus based methods:

  • Intrinsic evaluation
    • Listing linguistic constructions covered (no corpus)
    • grammatical coverage (unannotated corpus)
    • Average parse base (unannotated corpus)
    • Structural consistency (annotated corpus)
    • Best-first/Ranked consistency (annotated corpus)
  • Comparative evaluation
    • Entropy/Perplexity (unannotated corpus)
    • Part-Of-Speech assignment accuracy (annnotated corpus)
    • Tree similarity (annotated corpus)
    • Grammar evaluation interest group (GIEG) scheme (annotated corpus)
    • Dependency structure-based scheme (annotated corpus)



PASSAGE, French evaluation campaign for syntactic parsing (2007-2009).

The Parsing Task of EVALITA 2009

The Parsing Task of EVALITA 2007

EASY, Evaluation campaign for syntactic parsing organized by French Technolangue action EVALDA (2003-2006).

XTAG, wide-coverage grammar development project for English using a lexicalized Tree Adjoining Grammar (TAG) formalism (1998).

SPARKLE, Shallow Parsing and Knowledge extraction for Language Engineering, European project (1997-2000).

GRACE, Grammars and Resources for Analyzers of Corpora and their Evaluation, part of the French CCIIL program (1994-1997).


Workshop on Parsing with Categorial Grammars

11th International Conference on Parsing Technologies (IWPT’09)

TLT 7, The 7th International Workshop on Treebanks and Linguistic Theories (2009).

CoNLL Shared Task 2009: Syntactic and Semantic Dependencies in Multiple Languages (2009).

EVALITA 2009, Parsing task.

COLING 2008, workshop on "Cross-Framework and Cross-Domain Parser Evaluation".

LREC 2008, workshop on "Partial Parsing Between Chunking and Deep Parsing".

ACL 2008, workshop on "Parsing German".

IJCAI, workshop on "Shallow Parsing in South Asian Languages".

EVALITA 2007Parsing task.

COLING ACL 2006, tutorial on "Dependancy Parsing".

MSPIL-06, First National Symposium on Modeling and Shallow Parsing of Indian Languages.

LREC 2002, workshop on "Beyond PARSEVAL Towards Improved Evaluation Measures for Parsing Systems".

COLING 2000 Workshop on "Efficiency in Large-scale Parsing Systems".

LREC 1998, workshop on "The Evaluation of Parsing Systems".










 EASy Evaluation Package



More about evaluation measures.


  • Grune D. and Jacobs C. (1990) Parsing Techniques: a Pratical Guide. Published by Ellis Horwood, Chichester, England.
  • Carroll J., Briscoe T., Sanfilippo A. (1998) Parser Evaluation: a Survey and a New Proposal. In Proceedings First Conference on Linguistic Resources, p. 447-455


Information Extraction (IE) is a technology which extracts pieces of information that are salient to the user's needs. The kinds of information that systems extract vary in detail and reliability: named entities, attributes, facts and events.


Due to the complexity of the IE task, the limited performance of tools, there are few comparative evaluations in IE.
One can consider the Message Understanding Conference (MUC) as the starting point where most of IE evaluation methodology was defined.
The performance of a system is measured by scoring filled templates with the classical information retrieval (IR) evaluation metrics: precisionrecall and the F-measure. Another evaluation metric, based on the classification error rate is also used for IE evaluation. The annotated data are required for training and testing.


Given a system response and a human-generated answer key, the system's precision is defined as the number of slots it filled correctly, divided by the number of slots it attempted. Recall is defined as the number of slots it filled correctly, divided by the number of possible correct fills taken from the human-generated key. One general issue is how to define filler boundaries which is related to the question of how to assess an extracted fragment? Three criteria for matching reference occurrences and extracted ones are proposed (Freitag 1998):

  • The system output matches exactly a reference
  • The system output strictly contains a reference and at most {k} neighbouring tokens
  • The system output overlaps a reference

In Automatic Content Extraction (ACE) and MUC evaluation conferences, the criteria used for assessing each system output item are: correct, partial, incorrect, spurious, missing and non committal.


  • EVALITA 2009 : two task related to Named Entity Recognition, and Local Entity Detection and Recognition
  • BOEMIE (2006-2008)
  • ACE (1999-2008)
  • MUC (1987-1998)




JULIE Labs NLP Toolsuite

Language Resources
Domain-independent annotated corpora:

  • MUC corpora (newswire articles, also available in Spanish, Chinese and Japanese for multilingual entity task)
  • ACE corpora (broadcast news, newswire, translated documents from Chinese and Arabic Treebank)

Domain-specific annotated data:

  • Job postings from WWW (Califf 1988)
  • seminar announcements (Freitag 1998)




Maynard D., Peters W., and Li Y. (2006). Metrics for evaluation of ontology-based information extraction. In WWW 2006 Workshop on "Evaluation of Ontologies for the Web" (EON), Edinburgh, Scotland.

Freitag D. (1998). Information Extraction From HTML: Application of a General Learning Approach. In the Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98).

Califf M. E. and Mooney R. J. (1998). Relational Learning of Pattern-Match Rules for Information Extraction. In Proceedings of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, Stanford, CA, March.