Issue #5 | April 2023
- New ELRA Board
- Language Resources
- Legal Issues
- ELRA/ELDA Projects
- Evaluation Campaigns
New ELRA Board
The Board Members and Officers have been elected.
Further to the Board 2023-2025 elections that took place early this year, the ELRA Board has been largely renewed.
Khalid Choukri, ELRA Secretary General, is pleased to introduce the new ELRA Board and would like to congratulate all the newly elected Officers and Members.
- Simon Krek, Jožef Stefan Institute, President (new)
- Teresa Lynn, MBZUAI, Vice-President (new)
- Francesca Frontini, ILC-CNR, Secretary (new)
- Aline Villacencio, Sheffield University, Treasurer (new)
- Nancy Ide, Vassar College (new)
- Petya Osenova, IICT-BAS
- Patrick Paroubek, LISN (new)
- German Rigau, HITZ
- Sakriani Sakti, JAIST (new)
Patrick Paroubek and Sakriani Sakti attended the meeting remotely, therefore, they are not on the picture.
Honorary Presidents and Secretary General
- António Branco, FCUL, Honorary President (new)
- Nicoletta Calzolari, ILC-CNR, Honorary President
- Joseph Mariani, LIMSI-CNRS, Honorary President
- Khalid Choukri, ELDA CEO, ELRA Secretary General
Thank you to the outgoing Board Members and Officers for their continuous involvement over the years:
- Simonetta Montemagni (former ELRA Vice-President)
- Gilles Adda
- Nuria Bel
- Tatjana Gornostaja
- Piek Vossen
For more details on the current Board members, please visit the ELRA Board page @ https://bit.ly/2CH9hYU
LRs in the ELRA Catalogue this month
Since January 2023, we are happy to announce that 2 new speech corpora, 1 new lexicon and 2 new bilingual terminological resources are now available in our catalogue.
The MGB-5 Moroccan Dialect comprises 14 hours of Moroccan Arabic speech extracted from 93 YouTube videos distributed across seven genres: comedy, cooking, family/children, fashion, drama, sports, and science clips. The 93 YouTube clips have been manually labelled for speech, non-speech segments. About 12 minutes from each program were selected for transcription. In addition to the transcribed 14 hours, the full programs are also provided, which amounts 48 hours for the 93 programs.
Chinese-Vietnamese - PhraseBank with audio files of daily conversations spoken by native speakers containing 4002 sentence pairs. Scripts with Pinyin, Topic, Cat, Vietnamese translation with corresponding audio in Chinese and Vietnamese. Corpus in XML and WAV formats.
Manual translation of the 2.1 version of the English WordNet into Vietnamese containing 211000 entries, in Excel format.
Idioms French-Vietnamese Dictionary with French terms translated in Vietnamese and one idiomatic sentence per Vietnamese word of 448 entries in XML format.
Vietnamese Etymology Dictionary containing Vietnamese terms with correspondence in Kanji + Exp with meaning and examples of 3100 entries, provided in XML format.
The International Standard Language Resource Number (ISLRN) provides Language Resources (LRs) with unique identifiers using a standardised nomenclature. This aims to ensure that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications in R&D projects, products evaluation and benchmark as well as in documents and scientific papers.
- 11 new ISLRN numbers assigned between January and March 2023.
- A total of 3353 ISLRN numbers assigned since January 2014.
- A total of 270 distinct languages.
The latest LRs for which an ISLRN number was requested and accepted are as follows:
- 2019 NIST Speaker Recognition Evaluation Test Set -- Audio-Visual - ISLRN: 470-750-139-731-6
- LORELEI Tagalog Representative Language Pack - ISLRN: 119-682-138-197-9
- Broken plural list - ISLRN: 340-952-913-841-9
- CLEF-TREC Q/A - ISLRN: 680-984-485-076-7
- LORELEI Tamil Representative Language Pack - ISLRN: 143-652-360-602-5
- Mixer 3 Speech Mixer 3 Speech - ISLRN: 823-474-406-019-9
More about ISLRN.
The Italian GDPR authority bans ChatGPT over privacy concerns
On March 31, 2023, the Italian GDPR authority imposed an immediate limitation on Italian users’ data processing by ChatGPT. This ban has been imposed after the application has experienced a data breach on the users’ conversation and payment data.
In its decision, the Italian authority highlighted that there was no legal basis for the collection and processing of personal data for the purpose of training ChatGPT. Moreover in its assessment it was found out that some information made available by ChatGPT did not match factual circumstances.
Finally, the Italian GDPR authority also emphasised the absence of age verification procedure that can expose minors to receiving inappropriate answers.
The Italian imposed 20 days to OpenAI to comply with this decision and explain how the above issues would be addressed. Failing to do so, OpenAI could be fined of 20 million € or 4% of the total annual revenue.
Press release of the Italian GDPR supervisory authority available here.
Members of the European Parliament call for a renegotiation of the EU-US Data Privacy Framework.
Following the Opinion published by the European Data Protection Board, regarding the adequacy of the EU-US Privacy Framework, members of the European Parliament are in the process of adopting a motion asking the Commission not to adopt the adequacy decision unless the recommendations of the EDPB are fully implemented.
The core of the motion is concerned mainly with the fact that executive orders can be amended or revoked at any time. In addition, the motion points to the fact that intelligence activities carried out under the new EU-US privacy framework can be amended with no obligation to inform anyone and with no proportionality assessment carried out on the basis of EU principles.
The Members of the European Parliament also point out that the ad hoc Review Court instituted under the Privacy Framework would render classified decisions making it difficult to have clarity on the interpretation of legal concepts contained in the GDPR. Moreover, the resolution notes that the remedies available to companies are insufficient.
Finally this motion regrets the fact that the Executive Order does not prohibit bulk collection of data which would allow for mass surveillance.
Full text of the motion available here.
New Guidelines published by the European Data Protection Board
The European Data Protection Board released 3 sets of updated Guidelines during the month of April 2023 regarding the implementation and understanding of some provision of the GDPR as follows :
Guidelines on the identification of a lead supervisory authority. These guidelines are useful in the event of cross-national processing or where processors and controllers are located in different European countries (available here).
Guidelines on data subject rights. These Guidelines elaborate on the way users rights can be enforced especially the right of access to personal data pursuant to article 15 of the GDPR (available here).
Guidelines on personal data breach notification. These Guidelines offer advice on how processors have to notify breach. Indeed the GDPR provides that in the event of a personal data breach the controller must notify the supervisory authority of the said breach. These guidelines advise controllers on the timeline, content and procedure to follow when reporting personal data breaches and how to mitigate the risks (available here).
Information on the on-going projects
Common European Language Data Space (LDS)
- German Research Center for Artificial intelligence (DFKI) (coordinator),
- Evaluations and Language Resources Distribution Agency (ELDA),
- Athena Research and Innovation Center in Information, Communication and Knowledge Technologies (ILSP),
- SIA Tilde.
- A revised list of all personal data processing activities foreseen in the original Technical Offer.
- A first list of activities linked with existing Data Protection Records provided by the European Commission.
- The main structure and definitions of all subtasks that will be addressed in the Data Protection Concept document at later stages (Data Processing Impact Assessment, Data Protection Risk Assessment, LDS Infrastructure, Data Protection Notices, etc.).
Language Technology Solutions - CNECT/LUX/2022/OP/0030
This call for tenders from the European Commission was published within the Digital Europe programme (DIGITAL). It aims to achieve three specific goals: 1. facilitate uptake by SMEs, NGOs, public administration, and academia of European machine translation services for websites; 2. support the creation of open-source European language speech recognition solutions; 3. carry out market studies on language technologies and widely disseminate their results to foster the take-up of language technologies in Europe.
ELRA, through its operational body ELDA, is involved in two of the funded projects which are described below.
LOT 1 - Solutions Supporting the Use of Automated Translations on Websites
The project was officially launched on December 12, 2022 under the name “European Multilingual Web (EMW)”. EMW consortium is coordinated by Tilde (Latvia) with the participation of ELDA (Evaluations and Language resources Distribution Agency, France), IDC (International Data Corporation), Ogilvy (SIA Guilty, Latvia) and Rīga Stradiņš University (Latvia).
It involves four major tasks respectively consisting of:
Task 1: carrying out a comprehensive and evidence-based market study on the multilingualism of websites.
Task 2: delivering a set of ready-to-use open-source automated website translation solutions, and their subsequent maintenance and support (including helpdesk), including regular updating of relevant documentation, as required. ELDA will be responsible for the running of the helpdesk which will be set up for Month 9 of the project (September 2023).
Task 3: publishing a set of open-source automated website translation solutions developed during Task 2 on a dedicated solutions website and to achieve widespread use of the solutions with promotional activities, as well as to build awareness of EU actions to support and nurture multilingualism.
Task 4: developing and implementing the strategy to ensure the sustainability of the set of ready-to-use open-source websites automated translations solutions developed or supported under Task 2 after the end of the contract.
LOT 2 – Language Technologies Solutions
The project was officially launched on December 13, 2022.
The consortium operating in this project is coordinated by Brno University of Technology (BUT) with the participation of TILDE and ELDA. With the participation of all members of the consortium, three main tasks will be carried out, which are:
Task 1: A comprehensive market study of the Automatic Speaker Recognition (ASR) solutions. This includes an overview of the main stakeholders and techniques of the domain as well as the availability of speech and related transcription data for ASR. This task is mainly carried out by ELDA. In March 2023, a first draft of the study was produced and submitted to the European Commission (EC).
Task 2: Creation of an open-source basic speech recognition prototype solution. This task is mainly carried out by BUT and TILDE, with ELDA taking an advisory position. As for Task 1 a first deliverable describing the solution’s documentation and key features was produced and submitted to the European Commission (EC).
Task 3: Collection and partial transcription, one third, of speech data for three European under-resourced languages. A total of 4500 hours per language will be packaged under the responsibility and coordination of ELDA. The data will be used for training the solution developed in Task 2 as well as to constitute three corpora that will be delivered to the EC. In March 2023 a first legal analysis was sent to EC with basic information about the collection and transcription timeline.
- IWSLT 2023 Evaluation Campaign - https://iwslt.org/2023/
- SemEval-2023 - The 17th International Workshop on Semantic Evaluation - https://semeval.github.io/SemEval2023/
- VarDial Evaluation Campaign 2023 - https://sites.google.com/view/vardial-2023/shared-tasks
- HaSpeeDe 3 (Hate Speech Detection) shared task within Evalita 2023 -http://www.di.unito.it/~tutreeb/haspeede-evalita23/
News from ELRA
Lingotto Conference Center in Turin (Italy)
May 20-25, 2024
Two major international key players in the area of computational linguistics, the ELRA Language Resources Association (ELRA) and the International Committee on Computational Linguistics (ICCL), are joining forces to organize the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) to be held in Turin (Italy) on 20-25 May, 2024.
The hybrid conference will bring together researchers and practitioners in computational linguistics, speech, multimodality, and natural language processing, with special attention to evaluation and the development of resources that support work in these areas. Following in the tradition of the well-established parent conferences COLING and LREC, the joint conference will feature grand challenges and provide ample opportunity for attendees to exchange information and ideas through both oral presentations and extensive poster sessions, complemented by a friendly social program.
In addition to the the three-day main conference, workshops and tutorials will be held before and after the conference.
Conference website: https://lrec-coling-2024.lrec-conf.org/
Language Resources and Evaluation Journal
The Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications.
Issue 1, March 2023
Special Section: LREC 2020: Selected Papers (1-188)
News from the community
In memoriam, Chris Cieri (1963-2023)
The ELRA Board, the LREC and ELDA team members are deeply sad to announce that Chris Cieri, the Executive Director of the Linguistic Data Consortium (LDC) passed away on March 25, 2023.
ELRA and LDC have long been strong partners, sharing similar missions to distribute Language Resources and promote research and development of Language Technologies. Numerous collaborations have taken place between the two entities. We can quote a few: NETDC production/distribution, CONLL joint distribution, participation in NEMLAR/MEDAR, and lately ILSRN. The proper corpus citation of Language Resources in papers has long been discussed within the field, including during the LDC-ELDA meeting in Paris in 2011. In November 2013, at the NLP12 meeting, the representatives of major Natural Language Processing and Computational Linguistics organizations met with the objective of coordinating their activities within the field. Mark Liberman and Chris Cieri attended this meeting where the ISLRN was established by ELRA, LDC and AFNLP/Oriental COCSODA. Up to now, ELDA and LDC have shared the moderation/attribution of ISLRN.
Chris Cieri took part in all the strategic workshops ELRA has organized in the past 15 years, including those in Valletta, Paris, Florence, Dubrovnik and Lucca.
For a couple of editions, he served as a LREC Programme Committee member and was one of few who attended all the LREC editions, including the last one in Marseille, in June 2022.
Throughout the years, Chris has been a colleague of many and became a friend for some. He was respectful of the opinions of others, he was kind and caring, and everyone praised his constructive approach. He was also what the French call a "bon vivant", meaning a nice guy enjoying life, good company, good food, good chocolate and good wine...
Our sincere condolences go to his wife Mimi and his daughter Caitlin.
We will miss you dearly, Chris.
In Memoriam, Professor Yorick Wilks (1939-2023)
Professor Yorick Wilks, founder of Natural Language Processing Research Group at Sheffield University and AI pioneer, passed away on April 14, 2023 at the age of 83.
In 2008, the ELRA Board had awarded Yorick Wilks the Antonio Zampolli Prize at LREC in Marrakech. He gave a presentation entitled "In my beginning is my end: reflections on 45 years of NLP and corpora" that is available here: https://bit.ly/3UQOE2e
His obituary can be read here: https://www.oii.ox.ac.uk/news-...
Rest In Peace Professor Wilks