Automatic measurement of distances between languages using Swadesh lists and big text corpora

Description

The measurement of language distances is a crucial task both for the statement of phylogenetic models of language diversification and for important NLP tasks such as automatic language detection and translation. In this tutorial, the instructors will showcase two different methods for accomplishing this task. First, we take phonetically transcribed wordlists with about 200 words that comprise universal, i.e., culture-independent lexical items [1]. We will thus demonstrate how to clean this data and normalize it following IPA unicode [2]. Then, we use tools within the COG software [3] to audit cognate alignment through phonetic distance measurements. Finally, we visualize the graphs within COG (dendrograms and NeighborNets) and show further uses of the word-by-word distance for the detection of language correlations, for instance, by organizing the items by semantic field. For this method, training wordlists will be provided.

The second method involves information extraction from Wikimedia projects and analyses them using n-grams [4]. This model uses orthographic corpora which will be further divided into training and test datasets. It requires around 1 million words to produce highly reliable results. By involving syntactic and morphological information, this method goes beyond the phonetic distance algorithm above. Easily available corpora can be found at the opus.nlpl, but we will also introduce the use of a web scraper to collect big corpora.

Bibliography

[1] Swadesh, Morris. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21(2). 121–137.

[2] Moran, Steven & Cysouw, Michael. 2018. The Unicode cookbook for linguists: Managing writing systems using orthography profile. Berlin: Language Science Press.

[3] Daspit, Damian. 2015. Cog software. https://software.sil.org/cog/

[4] Pichel Campos, J. R. (2020). Tese – Medidas de distância entre línguas baseadas em corpus. Aplicação à linguística histórica do galego, português, espanhol e inglês. 

Syllabus

Introduction to language distance models (with some examples of Iberian Languages and their variation)

Automatic language distance measurement using Swadesh list

  • Data cleaning and normalization;
  • Introduction and use of COG software
  • Data visualization and further usage.

Coffee break

Automatic language distance measurement using text corpora

  • Data extraction
  • Division into test and training data
  • Visualization

Questions and Answers

Tutorial Instructors

Carlos Silva (CLUP | FLUP): cssilva@letras.up.pt | https://silvaphon.wordpress.com/

José Ramom Pichel (Proxecto Nós | CiTIUS – U. Santiago de Compostela) jramon.pichel@usc.es

Fábio Granja (FLUP): up202100212@letras.up.pt | https://www.linkedin.com/in/fábio-barcellos-granja-3511371a2/

Fábio Granja (FLUP): up202100212@letras.up.pt | https://www.linkedin.com/in/fábio-barcellos-granja-3511371a2/

Dates

  • Workshop day: 12/03/2024