CogNet – UKC – Universal Knowledge Core

map-based illustration of cognate data

What is CogNet? It is a large-scale database of cognate pairs: it contains 8.1 million cognates in 338 languages, 38 writing systems, and 91285 concepts. It was automatically constructed from wordnets and dictionaries contained within the UKC resource, as described in our paper.

What are cognates? In short, cognates are words in different languages that share a common origin and the same meaning, such as the English letter and the French lettre. CogNet links its cognates to Princeton WordNet synsets, making the shared meanings explicit. Inside CogNet, two well-distinguished kinds of cognates are included:

words that have the same meaning and are etymologically related according to gold-standard evidence (such as the Etymological WordNet or Wiktionary): such cognates, which constitute about 40% of CogNet, can be used for applications in historical linguistics;
words that have the same meaning but we only have indirect evidence of their relatedness, such as orthographic or phonetic similarity. This corresponds to a more relaxed interpretation of cognacy that also includes, e.g., loanwords, and that is well suited for applications in computational linguistics such as machine translation or bilingual lexicon induction.

Why are cognates important? Cognates have been extensively studied in the fields of language typology and historical linguistics, as they are considered useful for researching the relatedness of languages. Cognates are also used in computational linguistics, e.g., for lexicon extension or to improve cross-lingual NLP tasks such as machine translation or bilingual lexicon induction.

How is CogNet licenced? Under CC-BY-SA-NC-4.0.

Where can I download CogNet? All versions of CogNet can be downloaded from the CogNet project GitHub page. See the table below for the contents of each version.

Version	# Cognates	# Words	# Concepts	Precision	Kinds of evidence used
CogNet v2	8.14 M	1.07 M	91 k	95.62%	etymological, phonetic, orthographic, geographic
CogNet v1	3.16 M	570 k	81 k	93.94%	etymological, orthographic, geographic
CogNet v0	60 k	90 k	9 k	100%	etymological only

You can also download WikTra, the Wiktionary-based transliteration tool used for building CogNet.

While CogNet is free to use, we ask you to cite the following paper if you use it or the WikTra transliteration tool in your research:

Khuyagbaatar Batsuren, Gábor Bella, and Fausto Giunchiglia. CogNet: A Large-Scale Cognate Database. Proceedings of ACL 2019, Florence, Italy.

@inproceedings{batsuren2019cognet,
  title={CogNet: A Large-Scale Cognate Database},
  author={Batsuren, Khuyagbaatar and Bella, Gabor and Giunchiglia, Fausto},
  booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
  pages={3136--3145},
  year={2019}
}

How can I explore cognate data? Besides downloading the entire CogNet as a structured text file, you can also use the Linguarena website to display and browse (currently an older version of) cognate data interactively on a world map, as also shown in the figure above.

How is CogNet structured? Each row in the resource represents a cognate instance which is formed by the following individuals (columns are tab-separated):

Column	Description
concept	A code used by Princeton WordNet 3.0 to represent a synset.
language 1	the 3-letter iso code for the first language
word 1	a word in the language 1
language 2	the 3-letter iso code for the second language
word 2	a word in the language 2
evidence	direct etymological or indirect algorithmic
transliteration 1	a romanized word for the first word
tranlisteration 2	a romanized word for the second word

Example

concept	lang1	word1	lang2	word2	evidence	transliteration1	transliteration2
n14996158	glg	polipropileno	jpn	ポリプロピレン	ETY	NO_TRANSLIT	poripuropiren
n06566077	nep	सफ्टवेर	kas	سافٹویٚیَر	ALG	saphtawera	saftoeyar

Acknowledgements