Cradle of Civilization

A Blog about the Birth of Our Civilisation and Development

Indo-European languages tree by Levenshtein distance

Posted by Fredsvenn on October 10, 2014

Armenian language

In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

Levenshtein distance may also be referred to as edit distance, although that may also denote a larger family of distance metrics. It is closely related to pairwise string alignments.

It is named after Vladimir Levenshtein, a Russian scientist who has done research in information theory, error-correcting codes, and combinatorial design, and who considered this distance in 1965.

The evolution of languages closely resembles the evolution of haploid organisms. This similarity has been recently exploited to construct language trees. The key point is the definition of a distance among all pairs of languages which is the analogous of a genetic distance.

Many methods have been proposed to define these distances, one of this, used by glottochronology, compute distance from the percentage of shared “cognates”. Cognates are words inferred to have a common historical origin, and subjective judgment plays a relevant role in the identification process.

Here we push closer the analogy with evolutionary biology and we introduce a genetic distance among language pairs by considering a renormalized Levenshtein distance among words with same meaning and averaging on all the words contained in a Swadesh list. The subjectivity of process is consistently reduced and the reproducibility is highly facilitated.

We test our method against the Indo-European group considering fifty different languages and the two hundred words of the Swadesh list for any of them. We find out a tree which closely resembles the one published in with some significant differences.

Indo-European languages tree by Levenshtein distance

Levenshtein distance

Linguistic distance

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: