Automatic measurement of distances between languages using Swadesh lists and big text corpora
General information
Venue: Cidade da Cultura, Edifício Fontán, Room 10, day 12, from 9.30h to 13h.
For USC students, the registration to the tutorials is free (even the coffee break 😉). You can register by sending the name of the tutorial or tutorials you want to attend to this email: <propor2024@gmail.com>. The conference organisers are offering free shuttle buses:
Official PROPOR shuttles: The organisation will provide shuttles to the conference in the morning and back at the end of each day. The buses will leave from the Hotel Exe Peregrino, and will stop at Praza de Galicia, at Porta do Camiño, and at the hotels located in the San Lázaro area (Hotel Eurostars San Lázaro, and Hotel Puerta del Camino).
Requirements
A laptop is required to participate in the tutorial (Cog software running on Windows will be used in one part of the tutorial https://software.sil.org/cog/).
Description
The measurement of language distances is a crucial task both for the statement of phylogenetic models of language diversification and for important NLP tasks such as automatic language detection and translation. In this tutorial, the instructors will showcase two different methods for accomplishing this task. First, we take phonetically transcribed wordlists with about 200 words that comprise universal, i.e., culture-independent lexical items [1]. We will thus demonstrate how to clean this data and normalize it following IPA unicode [2]. Then, we use tools within the COG software [3] to audit cognate alignment through phonetic distance measurements. Finally, we visualize the graphs within COG (dendrograms and NeighborNets) and show further uses of the word-by-word distance for the detection of language correlations, for instance, by organizing the items by semantic field. For this method, training wordlists will be provided.
The second method involves information extraction from Wikimedia projects and analyses them using n-grams [4]. This model uses orthographic corpora which will be further divided into training and test datasets. It requires around 1 million words to produce highly reliable results. By involving syntactic and morphological information, this method goes beyond the phonetic distance algorithm above. Easily available corpora can be found at the opus.nlpl, but we will also introduce the use of a web scraper to collect big corpora.
Bibliography
[1] Swadesh, Morris. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21(2). 121–137.
[2] Moran, Steven & Cysouw, Michael. 2018. The Unicode cookbook for linguists: Managing writing systems using orthography profile. Berlin: Language Science Press.
[3] Daspit, Damian. 2015. Cog software. https://software.sil.org/cog/
[4] Pichel Campos, J. R. (2020). Tese – Medidas de distância entre línguas baseadas em corpus. Aplicação à linguística histórica do galego, português, espanhol e inglês.
Syllabus
Introduction to language distance models (with some examples of Iberian Languages and their variation)
Automatic language distance measurement using Swadesh list
- Data cleaning and normalization;
- Introduction and use of COG software
- Data visualization and further usage.
Coffee break
Automatic language distance measurement using text corpora
- Data extraction
- Division into test and training data
- Visualization
Questions and Answers
Tutorial Instructors
Carlos Silva (CLUP | FLUP): cssilva@letras.up.pt | https://silvaphon.wordpress.com/
Dates
- Workshop day: 12/03/2024