5th OpenCor: Latin American and Iberian Languages Open Corpora Forum

Full program

14h30 Opening

Presenter: Livy Real

14h40 Training Large Language Encoders with the Curated Carolina Corpus:

  • Presenter: Paulo Cavalin
  • Authors: Guilherme Mello, Paulo Cavalin, Felipe Ribas Serras, Marcelo Finger, Pedro Domingues, Miguel de Mello Carpi and Marcos Jose

15h Carolina: a dual purpose corpus of contemporary Portuguese under continuous development:

  • Presenter: Marcelo Finger
  • Authors: Felipe Ribas Serras, Mariana Sturzeneker, Maria Clara Crespo, Mayara Feliciano Palma, Miguel de Mello Carpi, Aline Silva Costa, Guilherme Lamartine de Mello, Vanessa Monte, Cristiane Namiuti, Maria Clara Paixão De Sousa and Marcelo Finger

15h20 Overview of Latin American and Iberian corpora in Sketch Engine.

  • Presenter: František Kovařík
  • Author: František Kovařík

15h40 Factive Verbs in Portuguese

  • Presenter: Livy Real
  • Authors: Valeria de Paiva and Livy Real

16h Coffee Break

16h30 Rodaviva Corpus:

  • Presenter: Oto Vale
  • Authors: Gabriela Wick-Pedro, Isaac Souza de Miranda Jr and Oto Vale

16h50 DOXCOR-br: Um corpus de discurso de ódio xenofóbico para o português brasileiro

  • Presenter: Amanda Oliveira
  • Authors: Amanda Oliveira, Eduardo Luz and Maxilene Faria

17h10 RePro: A Benchmark Dataset for Opinion Mining in Brazilian Portuguese.

  • Presenter: Karina Soares
  • Authors: Lucas Nildaimon dos Santos Silva, Ana Claudia Bianchini Zandavalle, Carolina Francisco Gadelha Rodrigues, Tatiana da Silva Gama, Fernando Guedes Souza, Phillipe Derwich Silva Zaidan, Alice Florencio Severino da Silva, Karina Soares and Livy Real  


This would be the fifth edition of OpenCor, a venue that aims to gather the community working on freely available language resources for the large variety of languages spoken in Iberian countries and in Latin America, including Portuguese and Galician. Recent years have seen a move in Computational Linguistics towards bigger and better, more reliably annotated corpora. However, the existence of such reliably annotated corpora is one of the big bottlenecks for processing natural language. Producing and maintaining corpora is a hard task that most of the time requires sizable funding and the cooperation of several experts. Although having such corpora available is clearly essential, the many difficulties and the amount of work needed to produce reliable corpora make the process of producing this data and making it available a non-trivial proposition. While “big data” is a trend, producing reliable corpora continues to be an invisible task in Natural Language Processing. Especially when working on languages different from English, on smaller datasets not immediately suitable for machine learning approaches, or on a new release of a previous dataset, it is not obvious to the corpora creators how to publish and properly discuss their work. Most of the biggest Natural Language Processing venues are not open to accepting corpora descriptions. The situation is even worse when considering minority languages and endangered languages since most of them do not have a related venue where these works can be discussed.

The Latin American and Iberian communities that produce open corpora do not have an established event that would make it possible for experts to share ideas, discuss difficulties, and get feedback on their work. Different meetings have been held in the last years, but either they are not generic enough to embrace all corpora work done in these communities, or there was no continuation and support for future editions. Due to these conditions, it is not rare that groups that share related interests or face the same difficulties are not aware of other groups and their recent work within these communities.

This forum aims both to fill the gap of having a permanent venue for construction, annotation, and maintenance of open corpora for Latin American and Iberian languages and to create an extensive list of these resources. OpenCor welcomes discussions on Portuguese/Galician, Spanish, indigenous languages, creoles, Catalan, Aragonese, Astur-Leonese, Aranese, and any other language spoken in Latin America and Iberian countries. Work on endangered languages, minority, and/or less-resourced languages is particularly welcome. To encourage the gathering of this community, including corpora’s maintainers, OpenCor does not necessarily ask for original work papers, but for open data discussions. 

Important Dates

  • First Call for papers: December 07th
  • Second Call for papers: January 2
  • Hard Deadline for submissions: January 17
  • Acceptance: February 1st
  • Session: 12th March

For further information, please visit https://opencor.gitlab.io/