Euromap Language Technologies

Your path to market for language and speech technologies

Summary

Consortium

Euromap Newsletter

HLT events

HLT success stories

Articles

Directory of HLT key players and tools in France

Who's who (HLT Central)

Useful links

E-Helpdesk

Contact us

Articles

Articles about Human Language Technologies commissioned by Euromap France
Link to other articles commissioned by Euromap team: http://www.hltcentral.org/page-960.shtml


Natural Language Querying

(June 2003)

Language technology in France: aiming for immortality?

(February 2003)

Natural language querying

Asking a question in free text and getting back a response as a natural expression is what every Web surfer wants. This is already possible on intranets and some e-commerce sites using natural language processing applications, and should soon spread to the Web as a whole.

The arrival of Internet and the spread of the Web have provided us with a vast quantity of fast-growing online information. In 2000, there were already a billion pages on the Web and that figure will have increased 100-fold by now. In so far as this information is designed for end normal users, without a any special training in computing or library science, means that it is very important to be able to ask questions in a natural language form and receive an answer of the same, or at least an understandable, kind. This is especially true for e-commerce sites that are the key targets for search engines. Yellow page and e-directories already played a key role in this area even before the web, especially in France when LexiQuest (acquired by SSPS) developed such applications for the Minitel videotex network in the 1990s..

France has a fairly central role in R&D in natural language processing, covering linguistics, lexicology, semantics, morphology and parsing, semantic webs/ knowledge maps, and more. A number of firms are active here too, including askOnce (which came out of the Xerox Innovation Group XRCE, based in Grenoble), Erli (which became LexiQuest now acquired SSPS), Kaidara, Lingway, Sinequa (formerly Cora), Technologies SA (T-GID) with Spirit (a natural-language driven query system) and Xylème.

"France developed a very proactive approach, especially among SMEs," says Christophe Binot, CIO at TotalFinaElf. Some research tools using natural language came out of the Web world (Autonomy, Kelkoo, for example), but almost all are derived from document management systems (Hummingbird, Xerox, etc.) or for knowledge management applications such as Kaidara and SER, since the research issues are similar.

 

Linguistic analysis of documents

Web documents are mainly indexed automatically as they are put online. Search engines began with simple key words, but more advanced systems can integrate full text search. Natural language queries therefore involve analysing the form of the query, as well as the target texts themselves to emphasize concordances between word strings.

Beyond word for word comparisons, these methods involve lexical, semantic, syntactic and statistical knowledge about language to fine-tune their search methods. For example, a search can use dictionaries which exploit synonymy or multilingual knowledge, as the Yellow pages do. To go even further, search engines can also correct or propose corrections of spelling errors rather like a spell checker in a word-processing application.

In 2000, the French Foreign Trade Centre (CFCE) set up an intranet on all exportable products and services using LexiQuest solutions. Successful queries of this system are due to its "fairly extensive multi-subject area dictionary for each sector - industry, agrifoods, services – making up 11,000 with 30,000 links between them," days Alain Rossi, IT manager at CFCE. "It is the quality of the dictionaries which determines the quality of the natural language responses."

To handle 800 letters and 500 telephone calls a week, the European Court of Human Rights has put all its documents on line to be consulted by legal experts and European citizens in general. This online knowledge management system, called HUDOC, uses language tools supplied by Hummingbird. Lingway, created by the former head of LexiQuest, Bernard Normier, is developing content and patent search tools for the pharmaceutical industry based on sophisticated linguistic dictionaries.

Autonomy can identify a combination of character strings considered as relevant by the search engine for statistical processing of a vast collection of documents. The results of the query depend on the frequency of the appearance of query’s words and expressions inside documents. "It gives the user a greater range of documents, and avoids having an information professional having to regularly update the knowledge bases," says Frédéric Demongeot, technical manager at Autonomy.

 

Smart questions, smarter answers

Natural language querying has a natural offspring – a natural language response. The ultimate approach would be not to reply to a search question by providing a text or a set of documents, but with a fully formulated answer. Sinequa, A French company dedicated to language processing, originally created in 1983 as Cora, was the lead manager in the Eureka project Carolus (1993-1997) which aimed to develop an intelligent information search system with just such a response function. This project engendered a software system called Intuition which was developed and commercialized by Sinequa.

A user asks a question such as "I’m looking for a size 38 dress at less than 100 euros." The system doesn’t just search in its database for words identical to those in the question; it corrects any spelling mistakes in the question, finds the product in the relevant database and checks that the right size is in stock. In another applications, Diva-Press, a press agency, uses Sinequa‘s Intuition technology is enable users to find information in a corpus of business and financial news.

The information extracted is categorized on the fly by the search engine and made automatically accessible to each user profile. A similar application has been developed for Le Monde newspaper. "Information professionals use Sinequa technology and can then run semantic searches on the 800,000 or so articles in the archive", explains Philippe Laval, founder and managing director of Sinequa.

These search techniques enable a more subtle appreciation of the question components, which in turn prevents irrelevant responses being returned. The major risk, however, is that you get nothing back at all. Another technique – fuzzy logic - added to semantic searching, opens the process to return at least one approximate response. The company Kaidara for example has developed a case-based query system which gets more and more effective as it adds new successes to its store of cases.

The product Text2Data can therefore handle concepts, not just words. It will understand for example that "the valve remained open" means the same as "the valve was not shut". It can also manipulate digital data in a text by using approximation to give a more or less satisfying response rather than nothing at all. For example, if a user is looking for an object with specific features and costing about 20,000 euros, the system will send back information about an object costing 20,500 euros, knowing this figure is close to that in the question.

When searches are confined to restricted domains, it is easier to get worthwhile results. This is the case for an enterprise intranet, for example, where documents are carefully circumscribed, the number of users is limited and in theory their questions focus only on certain specific areas. Hence the success of applications set up in press agencies or eCommerce sites.

Semantic research factors in the meaning of the language and the context. It helps filter a term according to the subject matter, discipline, or business sector in question. For example, Leroy Merlin has built a special dictionary on DIY for its site using Sinequa’s Intuition engine. And Rhodia, has chosen Kaidara’s Advisor technology to engage in intelligent dialogue between manufacturers of complex products and a technical document base.

Using LexiQuest technology, General Electric have been able since the ends of 1999 to ask questions of a search engine on the corporate web site in naturel language, as well as by key words. "We aim to help our professional and consumer customers to find what they want on our Web site as simply and quickly as possible," says Loretta Wilary, customer information manager at General Electric. The Kelkoo search engine provides supplier prices and conditions for products And services, using a Yellow Pages type approach.

However when there is no restriction of the scope of documents, semantic processing is essential. "Finding information in its context and adapting to a professional context, means factoring in semantics," says Sylvie Pichot, pre-sales consultant at Verity. Which is why the CFCE is using LexiQuest technology. "The specific nature of CFCE information space is that it covers any exportable product or service. So it requires a fairly deep multi-domain dictionary for each business sector as well as general business terms," says Alain Rossi.

 

XML And the semantic web

Already deployed in enterprise intranets, XML (eXtensible Mark-up Language) is set to gradually replace HTML on the web. It ahs already been adopted by several document management firms (Ixiasoft, Lingway, Sinequa, and Xylème).

By separating the semantic structure of a document from its physical representation, XML can collect heterogeneous documents together in a common structure. "XML can unify and give a global view of a document", says Philippe Laval. INRIA and Xylème, for example, are experiment with the automatic synthesis of relevant information in documents retrieved by a search engine. They are wagering on XML as a publication standard for information. "XML will offer more possibilities since you can add in semantic type tags, and not just structural data as in HTML", emphasizes Sophie Cluet, research director at INRIA and founder of Xylème.

R&D efforts in naturel language, advances in document management technologies and the emergence of related standards should help pave the way for everyone's dream – accessing directly and simply all the intelligence on the Web.

 

Enterprises referred to:

askOnce – www.askonce.com

Autonomy – www.autonomy.com

Hummingbird – www.hummingbird.com

Ixiasoft – www.ixiasoft.com

Kaidara – www.kaidara.com

Kelkoo – www.kelkoo.fr

Lingway – www.lingway.com

SER – www.ser.com

Sinequa – www.sinequa.com

SPSS - LexiQuest – www.spss.com/france

Technologies SA – Spirit – www.t-gid.com

Xylème – www.xyleme.com

Verity – www.verity.com

Claire Rémy is an free-lance journalist specialised in computer science. She has written books on artificial intelligence and science philosophy, in particular :

  • L’intelligence artificielle, Dunod, Paris 1994
  • L’intelligence et son miroir. Voyage autour de l’intelligence artificielle, Iderive, Lausanne 1990
  • La frontière entre déterminisme et indéterminisme : une réponse systémique, Lausanne 1989


Language technology in France: aiming for immortality?


If all goes according to plan, French language speakers in 50,000 years time will be able to find out what French (and other language) speakers in 2003 thought about the universe. A French-inspired project baptised KEO (www.keo.org) aims to send a winged satellite into deep space next year, containing a large digital corpus of individual messages and other texts in French – and 59 other languages, constituting a new ‘Library of Alexandria’ – to inform any of our descendants who might decode them about our 3rd millennium preoccupations.


History suggests that in the last 5,500 years, roughly from the birth of written language in Sumer, and but a tenth of the duration of KEO’s mission to the stars, the world has already lost thousands of tongues. It is continuing to lose them at a rate of hundreds a year (http://www.ogmios.org). But will there be a language technology engine on hand in the year 52003 to enable the planet’s inhabitants decipher what their French ancestors back in the age of the information society – how quaint! - were trying to communicate?


Well possibly not, despite the fact that companies such as Lingway (www.lingway.com) and Sinequa (www.sinequa.com) are two of the ‘expert’ companies enrolled in the KEO project to help with linguistic analysis of the messages to be stored in the space craft. They are both good examples of how French language – more specifically text - technology has evolved – some might say survived – over the last decade or so. But there is no evidence yet that they – and others like them – will become quasi-immortal!


Two decades of enterprise building

While Sinequa – which provides advanced search engine technology - evolved from a company first started way back in 1983 during the very first wave of language technology development in France, Lingway is a new company founded in 2001, yet whose CEO – Bernard Normier – is one of the historic figures of commercial language technology in France.


After launching ERLI in the 1980s to supply language technology to among other projects France Telecom’s mass market Minitel system. He morphed the company into Lexiquest during the 1990s, and expanded it internationally – notably to Silicon Valley – though it has since been acquired by a non-French technology firm. In a sense, Bernard Normier himself is back where he started – creating a new company to develop text technology for integration into specific vertical industry IT configurations.


In addition to firms such as Sinequa and Lingway, there are a good dozen or so active language technology companies in France, plus another dozen speech technology suppliers, making a comparatively healthy enterprise score when aligned against countries such as Germany and the United Kingdom. Most are small operations, though there are also several large enterprises such as Xerox, IBM, France Télécom and Thales which have long had major R&D centres in France, and which have spun off language tech companies such as Kalima from IBM and Telisma from France Télécom. But for Bernard Normier, the apparently high number of small companies is illusory, since many of them are created – as his was – by groups or individuals breaking away from existing companies in a process of increasing fragmentation and financial fragility.


According to Etienne Lamort de Gall, Marketing Officer for Elan Speech (http://www.elantts.com/accueil.html), one of the more successful French speech technology companies, “new business creation in the speech field in France is stagnant at the moment, apart from one or two rare exceptions’. This is partly due to the post dot.com blues, the reining in of development budgets, and the scarcity of venture capital, though even in the heady days of the boom, France never offered a particularly start-up-friendly environment. But it is also due, as Philippe Laval, CEO of Sinequa, suggests to the fact that the French “are pretty good at developing technology, but not particularly good at selling it.”


Language as culture not commerce

France as a world-class nation has certainly focused on the need for language technology, whether or not that need has been met. After at least 25 years of government and EC funding, there is clear consensus in the community that France has an international reputation for its language technology R&D. It was indeed France who first coined the termed ‘industries de la langue’ in a prescient report published in 1985 that claimed that there was an urgent need to ‘take the language apart’ in the words of the late Maurice Gross, one of the country’s eminent computational linguistics, if it were to meet the challenge of English in the global information and communication technologies revolution.


It is precisely this geopolitical commitment to reaffirm France’s role in the world through its language (Francophony) that has driven policymaking, rather than a concern for citizen’s requirements or economic competition. With the result, as Stéphane Chaudiron, of the ICT Division of the Ministry for Research suggests, that the growing awareness of the ‘decline’ of French as a major idiom of scientific and technological communication has been a two-edged sword when it came to developing a genuine language technology programme. Although the concept of Francophony prompted government funding for R&D, it also led to an over-emphasis on the political and cultural mission, instead of focusing on a more rigorous evaluation of the industrial and economic challenges that face the adaptation of a language to an information society context.


Whatever the ideological motivation, this policy has grown a very strong R&D base in language technology, supported as Chaudiron points out by a welcome ‘stability’ in the various Ministerial departments that have systematically been tracking and encouraging language technology over the years, thanks to the sustained commitment of people such as Jacques Matthieu at the Industry Ministry.


Lingway’s Bernard Normier recognises the ‘excellent level of university research centres’ as one of France’s strong points, and Elan’s Lamort de Gall agrees that France has a large population of researchers with a good training in linguistics, signal processing and language processing. But the fact remains that there is a perceived mismatch between the substantial effort at language technology R&D and the very slow transfer of resulting technologies and tools to France’s commercial and industrial users.


New R&D projects

Will the latest round of programme building change this imbalance between R&D excellence and commercial passivity? Following a year 2000 report, the government has decided to devote more language technology funding to three tracks – developing an HLT infrastructure by filling gaps in the language technology resources base (a project known as Technolangue (www.recherche.gouv.fr/appel/2002/technolangue.htm )), boosting the use of French language technology applications in the public sphere, and lastly, providing better training for digital content librarians. Ultimately, the triad of measures aims to encourage organisations to take up language technologies to boost their competitiveness in business intelligence and similar knowledge processing activities in a global economy.


Technolangue, masterminded by Joseph Mariani, another historic figure in French and European language industry circles through his work in speech technology, and now a member of the Ministry of Research ICT team, has a relatively small budget of 4 million euros, yet aims to build a stronger infrastructure in order to feed other existing language technology related development projects, which through public R&D networks are currently spending up to 100 million €’s on speech and language technology type projects across France.


Technolangue’s emphasis on creating and evaluating resources would appear to respond to a constant complaint among language technology firms – the lack of good, industrial-strength data for testing tools and extracting linguistic information for further development, among other needs. At the recent LangTech (www.lang-tech.org) event held in Berlin, for example, Francis Charpentier, Chief Technology Officer of the highly successful French speech technology supplier Telisma, argued that one of the key barriers to developing the speech recognition systems for next generation telecommunication services through a pervasive multimodal interface is the lack of good quality speech data in extremely large volumes – literally millions of spoken words in dozens of languages.


As it happens, France is home to the European agency for distributing vital language and speech resources. Founded in 1995, ELRA (European Language Resources Association) and its commercial arm ELDA ( Evaluation and Language resources Distribution Agency) were assigned the critical mission of handling the evaluation and distribution of resources in Europe on an independent basis for academic and industrial researchers alike. Naturally, ELRA is closely involved in the Technolangue initiative, even though as a not-for-profit body, it’s ultimate constituency is not just European but global.

Although widely welcomed as an excellent initiative, the programme is likely to find it hard to satisfy everyone. Sinequa’s Philippe Laval compares the French situation with the USA, where “DARPA efforts in linguistic resources are 10 to 100 times greater than those in France” and where a resource base has been created that is provided more or less free of charge.


Free resources, however, do not necessarily lead to a stronger industrial infrastructure for language technology. Frédérique Segond, who works in Business development at the Xerox Research Centre Europe (www.xrce.com) in Grenoble, one of the few dedicated industrial language technology R&D labs, explains that one of the real problems about developing resources is that “ language technology SMEs must see some commercial benefit to the effort they contribute to developing resources for the community, otherwise in the long run they will be tempted not to join R&D programmes.”


Rather than dishing out small subsidies to everyone, public or private, which is more or less the norm for French R&D programmes, she believes that strong agreements between public and private players will be key to ensuring a balanced return on investment in resource development: “the industrial player brings a certain real-world competence and focus to the resource process, and working with them will enable the academic teams to earn money as well. It is very hard to get companies to work for nothing!”


Softissimo’s (www.softissimo.com) Theo Hoffenberg, who heads France’s second largest automatic translation company after Systran, thoroughly agrees with the priority given to developing resources, but in his experience, producing high quality resources suited to enterprise type applications rather than R&D prototypes demands a high level of competence but is not usually considered intellectually exciting enough to attract university researchers. Nevertheless, he would argue that between his own company and Systran, another locally-incorporated company, France already has a unrivalled resource base in automatic translation.


Where’s the money?

Which raises the 64 thousand euro question – when all the research has been done, when the resources are available, when the tools have been licensed to industry, and when the technologies have matured enough to be fully stable, how do people make money out of language technology in France? Where is the real market for language technologies? Is it national or inherently cross border? And is it to be found in the consumer segment or inevitably linked to niche applications in small corners of the great information technology fabric?


For Philippe Laval, “the market is essentially technology driven: we are still selling to innovators and not to the heart of the market.” The reason for this, of course, is a constant complaint among market analysts over the years that Bernard Normier echoes today: “apart from translation, people do not really know they need language technologies. What they do know they need are solutions to specific problems.”


While the arrival in 1995 of the Internet as a corporate and consumer service gradually highlighted the problem of multilinguality in France, following the easier years of French-only Minitel, and the consequent need for automatic translation solutions, there has been no similar quantum leap in enterprise take-up of advanced search, summarisation or other content driven language technologies.


In many cases, in fact, government administrations and government-controlled companies have usually been among the most visible to adopt the technologies, in a very French command economy approach to stimulating technology development: the State not only funds the research and then through the ANVAR (http://www.anvar.fr/agenanglais.htm), the French innovation agency, supports technology transfer to SMEs whose products and services are eventually purchased… by government-funded organisations such as Ministries, banks, railways and industries.

Telling users about their needs

A further issue about growing a market, as Normier suggests, is understanding that language technologies form only one of the building blocks of a more comprehensive system or solution, and persuading enterprises and consumer application developers to buy into the technology. In Laval’s view, natural language processing will eventually become a commodity, operating as a pervasive component in IT applications. But there is still no magic bullet for persuading customers to pay closer attention to what is on offer.


Frédérique Segond identifies a certain “lack of diversity’” in available applications, arguing that the most appropriate test for mature language technologies will come from large multimedia type projects in which the language intelligence represents only 30% of the whole. Which means that if the marketplace is inevitably moving towards greater integration and greater convergence of language/speech/knowledge technologies in general, then what is missing is not the language technology but a robust platform for ‘industrialising’ it to meet this need for greater integration.


“Where France has fallen behind,” says Elan’s Etienne Lamort de Gall,” is in developing a strong industrial production chain that is robust and pervasive enough to enable companies to take a position on the market. Good R&D alone is completely insufficient for developing an economic and industrial fabric based on speech technologies, for example.”


However, as in any supply-driven market, “ it is up to us to create demand,” says Etienne Lamort de Gall. “After many years of evangelising we are starting to feel the emergence of real demand for speech technologies. But French demand is far behind that found in the US, Germany and the UK.”

Does size matter?

Which raises the issue of market size. Most of the smaller companies in the language technology don’t generates sales of more than 3 million € a year and in many cases much less. And growing their company internationally, although essential, is usually not an available strategy due to the high cost of developing technologies for other language markets. This does not appear to be true, however, for the speech recognition market, where multilinguality is easier to manage, through licensing agreements for existing signal processing modules.


Although speech-to-text applications require the same effort as text to internationalise products, Elan Speech’s premier market today is Germany, where market readiness for the firm’s radically multilingual application is considerably higher than in France. “The size of the French market is totally insufficient to enable a technology company to finance its growth from revenues.”

Consolidate or…

So would consolidation within the supply sector help solve the problem of market size and corporate fragility in the industry? The general feeling is that there is a real risk that some large foreign company might take over the whole of the French language technology in one fell swoop. There is a precedent in the office software market, where France commercialised several word processing packages in the mid-1980s when the personal computer began to spread through the population. Today there are no national software publishers left, even though the software sector employs as many people as the automobile industry, mainly due to France’s strong showing in the field of system integration with such players as Cap Gemini Ernst &Young.


Philippe Laval believes it is essential to consolidate the sector to fight collectively against the might of global software behemoths. As does Etienne Lamort de Gall, who suggests that the consolidation needed to develop international business will probably come naturally through partnerships between French companies. Bernard Normier reckons that consolidation is inevitable but warns that the combination of a certain lack of standards and the highly technical nature of the market mean that expert management will be an essential ingredient in the process. The notorious example of Lernout & Hauspie, it would seem, is precisely what to avoid.


A betting man would not wager on French language technology as we know it surviving 50,000 years along with satellite Keo. Meanwhile, companies like Lingway and Sinequa will be able to make use of this very French project – half poetic conceptualisation, half universal showcase – to collect quantities of text free of charge in a broad range of languages. And hopefully, help them provide better knowledge services to enterprises just five years down the road.


Writer

Andrew Joscelyne is a Paris-based consultant and writer about language technology. You can reach him at ajoscelyne@bootstrap.fr


ELDA © 2002 - Webmaster