{"id":1241,"date":"2024-02-06T20:22:02","date_gmt":"2024-02-06T19:22:02","guid":{"rendered":"https:\/\/propor2024.citius.gal\/?page_id=1241"},"modified":"2024-03-08T10:44:13","modified_gmt":"2024-03-08T09:44:13","slug":"automatic-measurement-of-distances-between-languages-using-swadesh-lists-and-big-text-corpora","status":"publish","type":"page","link":"https:\/\/propor2024.citius.gal\/index.php\/automatic-measurement-of-distances-between-languages-using-swadesh-lists-and-big-text-corpora\/","title":{"rendered":"Automatic measurement of distances between languages using Swadesh lists and big text corpora"},"content":{"rendered":"<p>[et_pb_section fb_built=&#8221;1&#8243; _builder_version=&#8221;4.20.2&#8243; _module_preset=&#8221;default&#8221; background_image=&#8221;https:\/\/propor2024.citius.gal\/wp-content\/uploads\/2023\/03\/fotoSantiagoCatedral.jpg&#8221; height=&#8221;500px&#8221; custom_padding=&#8221;0px||0px||false|false&#8221; custom_css_main_element=&#8221;position:relative;&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_row _builder_version=&#8221;4.20.2&#8243; _module_preset=&#8221;default&#8221; width=&#8221;100%&#8221; max_width=&#8221;100%&#8221; height=&#8221;501px&#8221; custom_margin=&#8221;|auto|24px|auto||&#8221; custom_padding=&#8221;0px||0px||false|false&#8221; custom_css_main_element=&#8221;display:flex;||flex-wrap:wrap;||align-content:flex-end;&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;4.20.2&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_text module_id=&#8221;textHome&#8221; module_class=&#8221;textTitleName&#8221; _builder_version=&#8221;4.20.2&#8243; _module_preset=&#8221;default&#8221; text_font=&#8221;|700|||||||&#8221; text_text_color=&#8221;#050505&#8243; text_font_size=&#8221;32px&#8221; background_color=&#8221;#FFFFFF&#8221; text_orientation=&#8221;center&#8221; width=&#8221;100%&#8221; height=&#8221;100px&#8221; custom_margin=&#8221;|-298px|||false|false&#8221; custom_padding=&#8221;||20px||false|false&#8221; custom_css_main_element=&#8221;position:relative;||display:flex;||flex-wrap:wrap;||align-content:center;||justify-content:center;||||&#8221; border_radii=&#8221;off||||&#8221; global_colors_info=&#8221;{}&#8221; custom_css_main_element_last_edited=&#8221;on|phone&#8221; custom_css_main_element_tablet=&#8221;position:relative;||display:flex;||flex-wrap:wrap;||align-content:center;||justify-content:center;||top:1px;&#8221; custom_css_main_element_phone=&#8221;position:relative;||display:flex;||flex-wrap:wrap;||align-content:center;||justify-content:center;||top:1px;&#8221;]<\/p>\n<p align=\"left\">Automatic measurement of distances between languages using Swadesh lists and big text corpora<\/p>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; _builder_version=&#8221;4.20.2&#8243; _module_preset=&#8221;default&#8221; custom_padding=&#8221;25px||25px||true|false&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_row _builder_version=&#8221;4.20.2&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;4.20.2&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_text _builder_version=&#8221;4.20.2&#8243; _module_preset=&#8221;default&#8221; text_text_color=&#8221;#252525&#8243; global_colors_info=&#8221;{}&#8221;]<\/p>\n<p><strong><\/strong><\/p>\n<p><strong>General information<\/strong><\/p>\n<p>Venue: <span style=\"font-size: 14px;\">Cidade da Cultura, Edif\u00edcio Font\u00e1n, Room 10, day 12, from 9.30h to 13h.<\/span><\/p>\n<p>For USC students, the registration to the tutorials is free (even the coffee break \ud83d\ude09). You can register by sending the name of the tutorial or tutorials you want to attend to this email: &lt;propor2024@gmail.com&gt;. The conference organisers are offering free shuttle buses:<\/p>\n<p>Official PROPOR shuttles: The organisation will provide shuttles to the conference in the morning and back at the end of each day. The buses will leave from the Hotel Exe Peregrino, and will stop at Praza de Galicia, at Porta do Cami\u00f1o, and at the hotels located in the San L\u00e1zaro area (Hotel Eurostars San L\u00e1zaro, and Hotel Puerta del Camino).<\/p>\n<p align=\"left\"><strong>Requirements<\/strong><strong style=\"font-size: 14px;\"><\/strong><\/p>\n<p align=\"left\">A laptop is required to participate in the tutorial (Cog software running on Windows will be used in one part of the tutorial <a href=\"https:\/\/software.sil.org\/cog\/\">https:\/\/software.sil.org\/cog\/<\/a>).<\/p>\n<p align=\"left\"><strong>Description<\/strong><\/p>\n<p align=\"left\">The measurement of language distances is a crucial task both for the statement of phylogenetic models of language diversification and for important NLP tasks such as automatic language detection and translation. In this tutorial, the instructors will showcase two different methods for accomplishing this task. First, we take phonetically transcribed wordlists with about 200 words that comprise universal, i.e., culture-independent lexical items [1]. We will thus demonstrate how to clean this data and normalize it following IPA unicode [2]. Then, we use tools within the COG software [3] to audit cognate alignment through phonetic distance measurements. Finally, we visualize the graphs within COG (dendrograms and NeighborNets) and show further uses of the word-by-word distance for the detection of language correlations, for instance, by organizing the items by semantic field. For this method, training wordlists will be provided.<\/p>\n<p align=\"left\">The second method involves information extraction from Wikimedia projects and analyses them using n-grams [4]. This model uses orthographic corpora which will be further divided into training and test datasets. It requires around 1 million words to produce highly reliable results. By involving syntactic and morphological information, this method goes beyond the phonetic distance algorithm above. Easily available corpora can be found at the opus.nlpl, but we will also introduce the use of a web scraper to collect big corpora.<\/p>\n<p align=\"left\"><strong>Bibliography<\/strong><\/p>\n<p align=\"left\">[1] Swadesh, Morris. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21(2). 121\u2013137.<\/p>\n<p align=\"left\">[2] Moran, Steven &amp; Cysouw, Michael. 2018. The Unicode cookbook for linguists: Managing writing systems using orthography profile. Berlin: Language Science Press.<\/p>\n<p align=\"left\">[3] Daspit, Damian. 2015. Cog software. <a href=\"https:\/\/software.sil.org\/cog\/\">https:\/\/software.sil.org\/cog\/<\/a><\/p>\n<p align=\"left\">[4] Pichel Campos, J. R. (2020). Tese &#8211; Medidas de dist\u00e2ncia entre l\u00ednguas baseadas em corpus. Aplica\u00e7\u00e3o \u00e0 lingu\u00edstica hist\u00f3rica do galego, portugu\u00eas, espanhol e ingl\u00eas.\u00a0<\/p>\n<p align=\"left\"><strong>Syllabus<\/strong><\/p>\n<p><strong><\/strong><\/p>\n<p>Introduction to language distance models (with some examples of Iberian Languages and their variation)<\/p>\n<p><span style=\"font-size: 14px;\">Automatic language distance measurement using Swadesh list<\/span><\/p>\n<ul>\n<li><span style=\"font-size: 14px;\">Data cleaning and normalization;<\/span><\/li>\n<li><span style=\"font-size: 14px;\">Introduction and use of COG software<\/span><\/li>\n<li><span style=\"font-size: 14px;\">Data visualization and further usage.<\/span><\/li>\n<\/ul>\n<p align=\"left\">\n<p align=\"left\">Coffee break<\/p>\n<p align=\"left\"><span style=\"font-size: 14px;\">Automatic language distance measurement using text corpora<\/span><\/p>\n<ul>\n<li align=\"left\"><span style=\"font-size: 14px;\">Data extraction<\/span><\/li>\n<li align=\"left\"><span style=\"font-size: 14px;\">Division into test and training data<\/span><\/li>\n<li align=\"left\"><span style=\"font-size: 14px;\">Visualization<\/span><\/li>\n<\/ul>\n<p align=\"left\"><span style=\"font-size: 14px;\">Questions and Answers<\/span><\/p>\n<p align=\"left\"><span style=\"font-size: 14px;\"><\/span><\/p>\n<p align=\"left\"><strong style=\"font-size: 14px;\">Tutorial Instructors<\/strong><\/p>\n<p align=\"left\">Carlos Silva (CLUP | FLUP): cssilva@letras.up.pt | <a href=\"https:\/\/silvaphon.wordpress.com\/\">https:\/\/silvaphon.wordpress.com\/<\/a><\/p>\n<div class=\"bubbles-group\">\n<div data-mid=\"215079\" data-peer-id=\"606369983\" data-timestamp=\"1709722829\" class=\"bubble hide-name is-in can-have-tail is-group-first is-group-last\">\n<div class=\"bubble-content-wrapper\">\n<div class=\"bubble-content\">\n<div class=\"message spoilers-container\" dir=\"auto\">\n<p>Lu\u00eds Trigo (CODA\/CLUP): <a href=\"mailto:ltrigo@letras.up.pt\">ltrigo@letras.up.pt<\/a> | <a href=\"https:\/\/coda.letras.up.pt\/\">https:\/\/coda.letras.up.pt\/<\/a><\/p>\n<p><span style=\"font-size: 14px;\">Jos\u00e9 Ramom Pichel (Proxecto N\u00f3s | CiTIUS \u2013 U. Santiago de Compostela) <\/span><span style=\"font-size: 14px; color: #1155cc;\"><u><a href=\"mailto:jramon.pichel@usc.es\">jramon.pichel@usc.es<\/a><\/u><\/span><\/p>\n<p><span style=\"font-size: 14px;\">F\u00e1bio Granja (FLUP):\u00a0<\/span><a href=\"mailto:up202100212@letras.up.pt\" style=\"font-size: 14px;\"><span style=\"color: #1155cc;\"><u>up202100212@letras.up.pt<\/u><\/span><\/a><span style=\"font-size: 14px;\"> | <\/span><a href=\"https:\/\/www.linkedin.com\/in\/f\u00e1bio-barcellos-granja-3511371a2\/\" style=\"font-size: 14px;\">https:\/\/www.linkedin.com\/in\/f\u00e1bio-barcellos-granja-3511371a2\/<\/a><\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p align=\"left\"><strong style=\"font-size: 14px;\"><\/strong><\/p>\n<p align=\"left\"><strong style=\"font-size: 14px;\">Dates<\/strong><\/p>\n<section id=\"h.25f2ebfea28890a4_81\" class=\"yaqOZd\">\n<div class=\"mYVXT\">\n<div class=\"LS81yb VICjCf j5pSsc db35Fc\" tabindex=\"-1\">\n<div class=\"hJDwNd-AhqUyc-uQSCkd Ft7HRd-AhqUyc-uQSCkd purZT-AhqUyc-II5mzb ZcASvf-AhqUyc-II5mzb pSzOP-AhqUyc-qWD73c Ktthjf-AhqUyc-qWD73c JNdkSc SQVYQc\">\n<div class=\"JNdkSc-SmKAyb LkDMRd\">\n<div class=\"\" jscontroller=\"sGwD4d\" jsaction=\"zXBUYb:zTPCnb;zQF9Uc:Qxe3nd;\" jsname=\"F57UId\">\n<div class=\"oKdM2c ZZyype Kzv0Me\">\n<div id=\"h.25f2ebfea28890a4_78\" class=\"hJDwNd-AhqUyc-uQSCkd Ft7HRd-AhqUyc-uQSCkd jXK9ad D2fZ2 zu5uec OjCsFc dmUFtb wHaque g5GTcb\">\n<div class=\"jXK9ad-SmKAyb\">\n<div class=\"tyJCtd mGzaTb Depvyb baZpAe\">\n<ul class=\"n8H08c UVNKR \">\n<li dir=\"ltr\" class=\"zfr3Q TYR86d eD0Rn \"><span style=\"font-size: 14px;\">Workshop day: 12\/03\/2024 <\/span><\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/section>\n<p><strong><\/strong><\/p>\n<p><strong><\/strong><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Automatic measurement of distances between languages using Swadesh lists and big text corpora General information Venue: Cidade da Cultura, Edif\u00edcio Font\u00e1n, Room 10, day 12, from 9.30h to 13h. For USC students, the registration to the tutorials is free (even the coffee break \ud83d\ude09). You can register by sending the name of the tutorial or [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_et_pb_use_builder":"on","_et_pb_old_content":"","_et_gb_content_width":""},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Automatic measurement of distances between languages using Swadesh lists and big text corpora - PROPOR 2024 - Universidade de Santiago de Compostela<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/propor2024.citius.gal\/index.php\/automatic-measurement-of-distances-between-languages-using-swadesh-lists-and-big-text-corpora\/\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"Automatic measurement of distances between languages using Swadesh lists and big text corpora - PROPOR 2024 - Universidade de Santiago de Compostela\" \/>\n<meta name=\"twitter:description\" content=\"Automatic measurement of distances between languages using Swadesh lists and big text corpora General information Venue: Cidade da Cultura, Edif\u00edcio Font\u00e1n, Room 10, day 12, from 9.30h to 13h. For USC students, the registration to the tutorials is free (even the coffee break \ud83d\ude09). You can register by sending the name of the tutorial or [&hellip;]\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/propor2024.citius.gal\/index.php\/automatic-measurement-of-distances-between-languages-using-swadesh-lists-and-big-text-corpora\/\",\"url\":\"https:\/\/propor2024.citius.gal\/index.php\/automatic-measurement-of-distances-between-languages-using-swadesh-lists-and-big-text-corpora\/\",\"name\":\"Automatic measurement of distances between languages using Swadesh lists and big text corpora - PROPOR 2024 - Universidade de Santiago de Compostela\",\"isPartOf\":{\"@id\":\"https:\/\/propor2024.citius.gal\/#website\"},\"datePublished\":\"2024-02-06T19:22:02+00:00\",\"dateModified\":\"2024-03-08T09:44:13+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/propor2024.citius.gal\/index.php\/automatic-measurement-of-distances-between-languages-using-swadesh-lists-and-big-text-corpora\/#breadcrumb\"},\"inLanguage\":\"en-GB\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/propor2024.citius.gal\/index.php\/automatic-measurement-of-distances-between-languages-using-swadesh-lists-and-big-text-corpora\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/propor2024.citius.gal\/index.php\/automatic-measurement-of-distances-between-languages-using-swadesh-lists-and-big-text-corpora\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/propor2024.citius.gal\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Automatic measurement of distances between languages using Swadesh lists and big text corpora\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/propor2024.citius.gal\/#website\",\"url\":\"https:\/\/propor2024.citius.gal\/\",\"name\":\"PROPOR 2024 - Universidade de Santiago de Compostela\",\"description\":\"International Conference on Computational Processing of Portuguese\",\"publisher\":{\"@id\":\"https:\/\/propor2024.citius.gal\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/propor2024.citius.gal\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-GB\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/propor2024.citius.gal\/#organization\",\"name\":\"PROPOR 2024 - Universidade de Santiago de Compostela\",\"url\":\"https:\/\/propor2024.citius.gal\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\/\/propor2024.citius.gal\/#\/schema\/logo\/image\/\",\"url\":\"http:\/\/172.16.240.236\/wp-content\/uploads\/2023\/03\/faviconPROPOR2024.png\",\"contentUrl\":\"http:\/\/172.16.240.236\/wp-content\/uploads\/2023\/03\/faviconPROPOR2024.png\",\"width\":200,\"height\":200,\"caption\":\"PROPOR 2024 - Universidade de Santiago de Compostela\"},\"image\":{\"@id\":\"https:\/\/propor2024.citius.gal\/#\/schema\/logo\/image\/\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Automatic measurement of distances between languages using Swadesh lists and big text corpora - PROPOR 2024 - Universidade de Santiago de Compostela","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/propor2024.citius.gal\/index.php\/automatic-measurement-of-distances-between-languages-using-swadesh-lists-and-big-text-corpora\/","twitter_card":"summary_large_image","twitter_title":"Automatic measurement of distances between languages using Swadesh lists and big text corpora - PROPOR 2024 - Universidade de Santiago de Compostela","twitter_description":"Automatic measurement of distances between languages using Swadesh lists and big text corpora General information Venue: Cidade da Cultura, Edif\u00edcio Font\u00e1n, Room 10, day 12, from 9.30h to 13h. For USC students, the registration to the tutorials is free (even the coffee break \ud83d\ude09). You can register by sending the name of the tutorial or [&hellip;]","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/propor2024.citius.gal\/index.php\/automatic-measurement-of-distances-between-languages-using-swadesh-lists-and-big-text-corpora\/","url":"https:\/\/propor2024.citius.gal\/index.php\/automatic-measurement-of-distances-between-languages-using-swadesh-lists-and-big-text-corpora\/","name":"Automatic measurement of distances between languages using Swadesh lists and big text corpora - PROPOR 2024 - Universidade de Santiago de Compostela","isPartOf":{"@id":"https:\/\/propor2024.citius.gal\/#website"},"datePublished":"2024-02-06T19:22:02+00:00","dateModified":"2024-03-08T09:44:13+00:00","breadcrumb":{"@id":"https:\/\/propor2024.citius.gal\/index.php\/automatic-measurement-of-distances-between-languages-using-swadesh-lists-and-big-text-corpora\/#breadcrumb"},"inLanguage":"en-GB","potentialAction":[{"@type":"ReadAction","target":["https:\/\/propor2024.citius.gal\/index.php\/automatic-measurement-of-distances-between-languages-using-swadesh-lists-and-big-text-corpora\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/propor2024.citius.gal\/index.php\/automatic-measurement-of-distances-between-languages-using-swadesh-lists-and-big-text-corpora\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/propor2024.citius.gal\/"},{"@type":"ListItem","position":2,"name":"Automatic measurement of distances between languages using Swadesh lists and big text corpora"}]},{"@type":"WebSite","@id":"https:\/\/propor2024.citius.gal\/#website","url":"https:\/\/propor2024.citius.gal\/","name":"PROPOR 2024 - Universidade de Santiago de Compostela","description":"International Conference on Computational Processing of Portuguese","publisher":{"@id":"https:\/\/propor2024.citius.gal\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/propor2024.citius.gal\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-GB"},{"@type":"Organization","@id":"https:\/\/propor2024.citius.gal\/#organization","name":"PROPOR 2024 - Universidade de Santiago de Compostela","url":"https:\/\/propor2024.citius.gal\/","logo":{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/propor2024.citius.gal\/#\/schema\/logo\/image\/","url":"http:\/\/172.16.240.236\/wp-content\/uploads\/2023\/03\/faviconPROPOR2024.png","contentUrl":"http:\/\/172.16.240.236\/wp-content\/uploads\/2023\/03\/faviconPROPOR2024.png","width":200,"height":200,"caption":"PROPOR 2024 - Universidade de Santiago de Compostela"},"image":{"@id":"https:\/\/propor2024.citius.gal\/#\/schema\/logo\/image\/"}}]}},"_links":{"self":[{"href":"https:\/\/propor2024.citius.gal\/index.php\/wp-json\/wp\/v2\/pages\/1241"}],"collection":[{"href":"https:\/\/propor2024.citius.gal\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/propor2024.citius.gal\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/propor2024.citius.gal\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/propor2024.citius.gal\/index.php\/wp-json\/wp\/v2\/comments?post=1241"}],"version-history":[{"count":11,"href":"https:\/\/propor2024.citius.gal\/index.php\/wp-json\/wp\/v2\/pages\/1241\/revisions"}],"predecessor-version":[{"id":1558,"href":"https:\/\/propor2024.citius.gal\/index.php\/wp-json\/wp\/v2\/pages\/1241\/revisions\/1558"}],"wp:attachment":[{"href":"https:\/\/propor2024.citius.gal\/index.php\/wp-json\/wp\/v2\/media?parent=1241"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}