CogNet

map-based illustration of cognate data

What is CogNet? It is a large-scale database of cognate pairs: it contains 8.1 million cognates in 338 languages, 38 writing systems, and 91285 concepts. It was automatically constructed from wordnets and dictionaries contained within the UKC resource, as described in our paper.

What are cognates? In short, cognates are words in different languages that share a common origin and the same meaning, such as the English letter and the French lettre. CogNet links its cognates to Princeton WordNet synsets, making the shared meanings explicit. Inside CogNet, two well-distinguished kinds of cognates are included:

  • words that have the same meaning and are etymologically related according to gold-standard evidence (such as the Etymological WordNet or Wiktionary): such cognates, which constitute about 40% of CogNet, can be used for applications in historical linguistics;
  • words that have the same meaning but we only have indirect evidence of their relatedness, such as orthographic or phonetic similarity. This corresponds to a more relaxed interpretation of cognacy that also includes, e.g., loanwords, and that is well suited for applications in computational linguistics such as machine translation or bilingual lexicon induction.

Why are cognates important? Cognates have been extensively studied in the fields of language typology and historical linguistics, as they are considered useful for researching the relatedness of languages. Cognates are also used in computational linguistics, e.g., for lexicon extension or to improve cross-lingual NLP tasks such as machine translation or bilingual lexicon induction.

How is CogNet licenced? Under CC-BY-SA-NC-4.0.

Where can I download CogNet? All versions of CogNet can be downloaded from the CogNet project GitHub page. See the table below for the contents of each version.

Version # Cognates # Words # Concepts Precision Kinds of evidence used
CogNet v2 8.14 M 1.07 M 91 k 95.62% etymological, phonetic, orthographic, geographic
CogNet v1 3.16 M 570 k 81 k 93.94% etymological, orthographic, geographic
CogNet v0 60 k 90 k 9 k 100% etymological only

You can also download WikTra, the Wiktionary-based transliteration tool used for building CogNet.

While CogNet is free to use, we ask you to cite the following paper if you use it or the WikTra transliteration tool in your research:
Khuyagbaatar Batsuren, Gábor Bella, and Fausto Giunchiglia. CogNet: A Large-Scale Cognate Database. Proceedings of ACL 2019, Florence, Italy.

@inproceedings{batsuren2019cognet,
  title={CogNet: A Large-Scale Cognate Database},
  author={Batsuren, Khuyagbaatar and Bella, Gabor and Giunchiglia, Fausto},
  booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
  pages={3136--3145},
  year={2019}
}

How can I explore cognate data? Besides downloading the entire CogNet as a structured text file, you can also use the Linguarena website to display and browse (currently an older version of) cognate data interactively on a world map, as also shown in the figure above.

How is CogNet structured? Each row in the resource represents a cognate instance which is formed by the following individuals (columns are tab-separated):

ColumnDescription
concept A code used by Princeton WordNet 3.0 to represent a synset.
language 1the 3-letter iso code for the first language
word 1a word in the language 1
language 2the 3-letter iso code for the second language
word 2a word in the language 2
evidencedirect etymological or indirect algorithmic
transliteration 1a romanized word for the first word
tranlisteration 2a romanized word for the second word

Example

concept lang1 word1 lang2 word2 evidence transliteration1 transliteration2
n14996158 glg polipropileno jpn ポリプロピレン ETY NO_TRANSLIT poripuropiren
n06566077 nep सफ्टवेर kas سافٹویٚیَر ALG saphtawera saftoeyar

Acknowledgements